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The  divergent  evolution  of  protein  sequences  from  genomic  databases  can  be 
analyzed  using  different  mathematical  models.  The  most  common  treat  all  sites  in  a 
protein  sequence  as  equally  variable.  More  sophisticated  models  acknowledge  the  fact 
that  purifying  selection  generally  tolerates  variable  amounts  of  amino  acid  replacement  at 
different  positions  in  a  protein  sequence.  In  their  "stationary"  versions,  such  models 
assume  that  the  replacement  rate  at  individual  positions  remains  constant  throughout 
evolutionary  history.  "Non-stationary"  covarion  versions,  however,  allow  the 
replacement  rate  at  a  position  to  vary  in  different  branches  of  the  evolutionary  tree. 
Recently,  statistical  methods  have  been  developed  that  highlight  this  type  of  variation  in 
replacement  rates.  Here,  we  show  how  positions  that  have  variable  rates  of  divergence  in 
different  regions  of  a  tree  ("covarion  behavior"),  coupled  with  analyses  of  experimental 
three-dimensional  structures,  can  provide  experimentally  testable  hypotheses  that  relate 
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individual  amino  acid  residues  to  specific  functional  differences  in  those  branches.  We 
illustrate  this  in  the  elongation  factor  family  of  proteins  using  various  statistical 
inferences.  The  recent  crystal  structure  of  eukaryotic  elongation  factor,  bound  to  its 
nucleotide  exchange  factor,  demonstrates  the  predictive  powers  associated  with 
incorporating  the  covarion  model  into  comparative  evolutionary  analyses. 

In  addition,  based  on  previous  work  in  this  laboratory,  we  show  that  incorporating 
higher  order  models  of  sequence  evolution  leads  to  predictions  of  ancestral  reconstruction 
character  states  that  are  different  than  those  predicted  by  the  less-sophisticated  parsimony 
method.  We  also  lay  the  foundation  for  determining  whether  the  ancestral  organism  to 
all  extant  bacteria  lived  in  a  cold-  or  hot-temperature  environment.  In  conclusion,  we 
advocate  the  use  of  higher  order  models  in  comparative  evolutionary  analyses  especially 
as  the  community  attempts  to  predict  protein  function  from  the  large  amount  of  genomic 
data  currently  being  produced  by  sequencing  projects. 
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CHAPTER  1 

DETERMINING  THE  PATHS  OF  DIVERGENT  EVOLUTION  THROUGH 

RECONSTRUCTED  ANCESTRAL  PROTEINS 


Reconstructing  the  Divergent  Evolution  of  Elongation  Factors 

Background  on  the  Early  Evolution  of  Life  on  Earth 

From  4.5  billion  to  3.8  billion  years  ago,  Earth  endured  a  heavy  bombardment  from 
meteors.  These  impacts  often  had  devastating  results.  It  is  estimated  that  some  of  the 
impactors  were  at  least  500  kilometers  in  diameter.  Impacts  from  such  large  objects 
would  create  a  superheated  atmosphere  of  vaporized  rock,  which  would  in  turn  have 
vaporized  the  oceans  and  sterilized  the  surface  of  the  planet  (Sleep  et  ai,  1989).  Since 
the  first  signs  of  life  appear  not  long  after  the  bombardments  stopped  (Schopf,  1993), 
researchers  have  used  the  information  to  form  two  hypotheses.  First,  the  only  organisms 
to  survive  such  conditions  must  have  lived  deep  in  the  Earth's  crust  where  temperatures 
were  high  (Sleep  et  al,  1989).  This  implies  that  these  very  early  organisms  were 
thermophiles.  Alternatively,  life  may  have  repeatedly  been  invented  and  extinguished  by 
repeated  sterilizing  impacts  (Maher  and  Stevenson,  1988).  This  implies  that  these  very 
early  organisms  need  not  have  been  thermophiles,  but  could  have  also  been  mesophiles. 

Meteorite  blasts  were  just  one  problem  facing  early  life.  If,  as  proposed  by  Kasting 
(1997),  energy  coming  to  Earth  from  the  Sun  was  30%  less  during  our  planet's  early  years 
than  today,  the  Earth's  surface  should  have  been  largely  frozen,  at  least  until  the  Sun 
brightened  about  2  billion  years  ago.  This  has  led  researchers  to  two  more  ideas.  First, 
early  life  may  have  lived  in  water  just  beneath  the  ice  surface.  Alternatively,  early  life 
may  not  have  survived  the  freeze,  except  by  retreating  to  thermal  vents  deep  in  the  ocean 
(Gaidosera/.,  1999). 


Until  25  years  ago,  microbiology  could  not  even  begin  to  help  analyze  this  problem, 
simply  because  the  true  relationship  between  bacteria  was  not  correctly  understood. 
Starting  in  the  mid-1970's,  Carl  Woese  (Woese  and  Fox,  1977)  began  to  piece  together 
what  seemed,  at  first,  to  be  a  clear  and  coherent  picture  of  bacterial  evolution.  This  was 
done  by  examining  rRNA  sequences  using  phylogenetic  methods. 

rRNA  sequences  remain  one  of  the  most  useful  and  most  used  of  the  molecular 
chronometers.  They  are  present  in  all  organisms  in  homologous  forms.  Different 
positions  in  their  sequences  change  at  different  rates,  allowing  a  researcher  to  not  only 
use  rRNA  to  determine  close  relationships,  but  also  to  determine  distant  relationships  as 
well.  Most  importantly,  rRNAs  can  be  sequenced  directly  and  rapidly  by  means  of 
reverse  transcriptase  (Lane  et  al.,  1985). 

From  16S  rRNA  sequences,  a  "Universal  Tree"  was  built  (Figure  1-1).  In  this  tree,  life 
on  Earth  was  divided  into  three  distinct  domains.  Prokaryotes,  formerly  thought  to  be 
one  "kingdom",  were  divided  by  the  16S  rRNA  sequences  into  two  separate  lineages, 
eubacteria  (now  called  bacteria)  and  archaebacteria  (now  called  archaea).  The  most 
ancient  point  in  the  tree  was  termed  the  "last  common  ancestor,"  or  LCA.  Figure  1-2 
shows  the  bacterial  topology  based  on  rRNAs  from  cultured  and  non-cultured  strains 
(Pace,  1997). 

Finding  the  oldest  point  (the  "root")  on  the  tree  is  not  trivial,  as  the  sequences 
themselves  do  not  necessarily  contain  information  about  geological  time.  If,  however,  the 
LCA  contained  a  pair  of  paralogous  genes  that  were  created  by  duplication  before  the 
divergence  of  the  three  domains,  and  if  these  descendents  of  both  paralogs  survive  in 
modern  representatives  of  each  domain,  then  these  sequences  may  be  used  to  root  the 
universal  tree.  Several  proteins  have  been  used  to  root  the  tree.  Gogarten  et  al.  (1989) 
used  duplicated  domains  in  H+  ATPases,  Brown  and  Doolittle  (1995)  used  aminoacyl- 
tRNA  synthetase  duplications,  Baldauf  et  al.  (1996)  used  elongation  factors  Tu  and  G, 
while  Gribaldo  and  Cammarano  (1998)  used  duplications  in  the  signal  recognition 


particles  to  root  the  tree.  All  these  studies  concluded  that  the  root  of  the  Universal  Tree 
lies  on  the  branch  separating  bacteria  from  the  archaea/eukaryotic  bifurcation. 

The  Universal  Tree  was  used  to  draw  inferences  about  features  of  the  environment  of 
ancestral  organisms,  in  particular,  about  the  temperature  environment  of  the  LCA  at  the 
root  of  the  tree.  These  were,  for  the  most  part,  based  on  a  combination  of  ideas  derived 
from  parsimony,  and  the  notion  that  ancestors  are  more  like  the  desendents  to  which  they 
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Figure  1-1.  Universal  Tree  of  Life  based  on  16S  rRNA  sequence  data.  Adapted  from 
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Figure  1-2.  Bacterial  phylogeney.  Adapted  from  Pace  (1997). 


are  connected  by  the  shortest  branch.  For  example  Aquifex  pyrophilus  and  Thermotoga 
maritima  are  both  thermophiles.  They  both  occupy  branches  that  diverge  near  the  base  of 
the  bacterial  lineage  (Burggraf  et  ai,  1992).  Therefore,  it  was  concluded  that  the 
organisms  at  the  base  of  the  tree  were  thermophiles.  Similarly,  the  most  deeply  branching 
archaeal  sequences  seemed  to  be  thermophiles  (Woese,  1987).  Therefore,  a  compelling 
case  could  be  made  that  the  LC  A  was  a  heat-lover. 

Again,  the  idea  that  the  LCA  was  thermophilic  comes  from  the  rRNA  tree  built  by 
Woese  and  others.  However,  what  if  the  rRNA  tree  was  incorrect  or  not  fully  resolved? 
This  is  what  some  researchers  are  concluding.  Delong  et  al.  (1994)  discovered  that  there 
are  many  mesophilic  archaea  that  branch  early  within  the  archaea  domain.  Although 
these  branch  lengths  are  longer  than  their  thermophilic  counterparts,  it  can  be  argued  that 
thermophilic  organisms  have  a  high  GC  content  due  to  structural  constraints,  which 
would  slow  down  their  evolutionary  rate.  This  decreased  evolutionary  rate  can  result  in 
short  branches.  Further,  Galtier  et  al.  (1999)  used  rRNA  to  analyze  the  GC  content  of  the 


last  common  ancestor.  Based  on  a  maximum  likelihood  model  that  allows  for  variable 
substitution  rates  among  lineages  and  assumes  that  the  GC  mutation  rate  has  not  reached 
equilibrium,  these  researchers  showed  that  the  LCA  did  not  contain  a  high  GC  content,  as 
appears  to  be  necessary  for  thermophilicity. 

Over  the  past  few  years,  a  number  of  articles  have  been  published  that  call  into 
question  past  interpretations.  These  articles  question  the  assumption  that  thermophiles 
connected  by  branches  near  the  base  of  the  tree  necessarily  lead  to  the  conclusion  that  the 
species  at  the  base  was  a  thermophile,  discuss  the  fact  that  analyzing  various  genes  result 
in  different  topologies  for  the  universal  tree,  and  the  effect  of  lateral  gene  transfer.  Also 
being  questioned  are  the  methods  used  to  create  phylogenies,  the  accuracy  of  the 
placement  of  branches  deep  in  the  tree,  and  a  lack  of  the  use  of  indel  (insertion/deletion) 
information  to  test  phylogenetic  relationships.  The  next  few  paragraphs  will  further  detail 
these  ideas. 

Recently,  two  research  groups  set  out  to  study  the  diversity  of  thermophiles  within 
different  environments.  Sekiguchi  et  al.  (1998)  used  prokaryote-specific  rRNA  PCR 
primers  to  amplify  sequences  from  methanogenic  granular  sludge.  This  culture- 
independent  approach,  so  termed  because  the  organism  is  not  required  to  be  grown  in  a 
laboratory  culture,  revealed  interesting  results.  Of  the  110  distinct  thermophilic  clones, 
only  22%  were  archaea  while  the  remaining  78%  were  bacteria.  Phylogenetic  analysis 
placed  some  of  these  sequences  branching  as  high  up  the  bacterial  lineage  as  the  T. 
thermodesulfovibrio  and  green  non-sulfur  divisions.  Hugenholtz  et  al.  (1998a)  performed 
a  similar  test  but  used  bacteria-specific  rRNA  primers  in  the  Obsidian  Pool  at 
Yellowstone  National  Park  (a  hot  spring).  Phylogenetic  analysis  determined  that  these 
thermophilic  species  branched  as  high  up  the  tree  as  proteobacteria.  Also,  Hugenholtz  et 
al.  (1998b)  concluded  that  several  of  these  sequences  constitute  new  divisions  within  the 
bacterial  tree.  The  conclusion  of  the  LCA  being  thermophilic,  based  on  previous 
phylogenetic  analysis  solely  placing  thermophiles  at  the  base  of  the  bacterial  lineage,  may 


no  longer  be  strongly  supported  (albeit  some  of  these  results  can  still  support  ancestral 
thermophily). 

Hasegawa  and  Hashimoto  (1993)  concluded  that  phylogenetic  analyses  based  on 
rRNA  genes  could  be  unreliable  due  to  extreme  AT  or  GC  nucleotide  bias  in  the  rRNA 
genes  of  some  taxa.  In  light  of  the  criticism  for  using  rRNA  as  a  phylogenetic  marker, 
many  researchers  are  turning  to  protein-coding  genes  for  analysis  of  the  universal  tree. 
Examples  of  this  are  the  H.  Klenk  and  W.  Zillig  research  groups.  These  researchers 
(Klenk  and  Zillig,  1994;  Klenk  et  al,  1999)  used  RNA  polymerase  sequence  data  to  study 
the  phylogenetic  relationships  of  bacterial  divisions.  In  these  studies,  it  was  found  that 
the  aquifex  division  grouped  with  the  proteobacteria  at  the  top  of  the  tree,  whereas 
Burggraf  et  al.  (1992)  showed  that  the  branch  leading  to  A.  pyrophilus  was  placed  at  the 
base  of  the  bacterial  lineage.  Also,  the  Klenk  group  concluded  that  mesophilic 
mycoplasmas  were  the  first  species  to  diverge  from  the  base  of  the  bacterial  tree,  thus 
suggesting  thermophiles  are  not  at  the  base  of  the  bacterial  lineage. 

It  is  becoming  increasingly  apparent  that  many  genes  within  eukaryotes  and 
prokaryotes  have  been  acquired  by  horizontal  transfer  (Jain  et  al,  1999).  Specifically, 
extensive  horizontal  transfer  has  taken  place  for  operational  genes  (those  involved  in 
housekeeping),  whereas  horizontal  transfer  rarely  takes  place  for  informational  genes 
(those  involved  in  translation,  transcription  and  other  related  processes)  (Rivera  et  al, 
1998).  Benner  et  al.  (1989)  suggested  that  the  machinary  for  transcription,  translation, 
and  replication  were  present  in  the  protogenome  (the  most  recent  common  ancestor  of 
modern  life  forms).  Since  translation  is  very  complex  and  its  key  components  tend  to  be 
universally  conserved  among  the  three  domains  of  life,  Woese  (1998)  argued  that  this 
complex  was  probably  the  first  to  be  refined.  Although  RNA  polymerases  within  the 
three  domains  of  life  share  common  components,  there  are  many  components  that  are  not 
universal.  It  appears,  then,  transcription  was  refined  after  translation,  yet  it  is  fairly 
immune  to  lateral  transfer  (Jain  et  al,  1999).  Only  later  did  genome  replication  become 


refined.  Since  translation,  transcription,  and  replication  are  effected  by  large  complexes 
and  tightly  integrated,  it  seems  intuitive  that  there  is  not  much  data  to  suggest  that  their 
components  were  horizontally  transferred. 

The  problem  of  long-branch  attraction  on  a  phylogeny  has  been  known  for  over 
twenty  years  (Felsenstein,  1978).  This  phenomenon  results  in  the  robust  grouping  of  long 
branches  on  a  tree,  regardless  of  the  underlying  phylogeny.  This  occurs,  for  example, 
when  an  outgroup  is  distantly  related  to  a  set  of  ingroups.  If  any  member  of  the  ingroup 
evolves  considerably  faster  than  the  other  ingroup  members,  it  will  be  placed  too  deeply 
in  the  tree.  This  ingroup  is,  what  is  termed,  "attracted"  (more  similar)  to  the  outgroup. 
Philippe  and  Laurent  (1998)  give  various  examples  of  the  above  situation.  In  particular, 
they  show  how  three  well-studied  thermophilic  divisions  from  the  bacterial  domain  may 
be  exhibiting  long-branch  attraction  (thermus,  thermotoga,  and  aquifex  divisions  ).  Other 
studies  suggest  that  these  three  divisions  may  be  more  closely  related  to  other  bacterial 
divisions;  thermus  may  group  with  cyanobacteria  (Gupta  and  Johari,  1998),  aquifex  may 
group  with  proteobacteria  (Klenk  et  ah,  1999),  and  thermotoga  may  group  with  gram+ 
bacteria  (Gupta,  1998). 

The  possible  monophyletic  grouping  between  aquifex  and  proteobacteria  was 
discussed  in  a  previous  section  (see  above).  The  other  two  possible  groupings  (thermus 
with  cyano,  and  thermotoga  with  proteo)  are  derived  from  the  analysis  of  insertion  and 
deletion  data  (indels)  to  infer  phylogenetic  topologies.  Here,  Gupta  and  colleagues  (1997; 
1998)  used  insertions  and  deletions  in  Heat  Shock  Protein  70  (HSP70)  bacterial 
sequences  to  elucidate  groupings.  They  also  used  the  neighbor-joining  and  parsimony 
phylogenetic  methods  to  determine  groupings.  All  three  methods  suggested  that  a  close 
relationship  exists  between  the  thermus  division  and  cyanobacteria,  thus  placing  thermus 
higher  up  the  bacterial  lineage.  Also,  based  solely  on  indels,  they  have  demonstrated  the 
possible  close  relationship  between  thermotoga  and  gram+  bacteria  using  the  two  proteins 
HSP70  and  glutamate-1-semialdehyde  2,1  aminomutase.  Although  using  indels  is  not  a 
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strong  statistical-based  method,  it  is  suggestive  and  should  cause  us  to  re-evaluate  some 
phylogenetic  relationships. 

In  review,  the  subject  of  thermophilicity  in  the  last  common  ancestor  is  greatly 
debated  and  clearly  unresolved.  The  last  few  years  have  seen  a  resurgence  of  interest  in 
this  issue  due  in  part  to  new  assays,  methods  and  sequences.  Hopefully  novel  approaches 
will  be  able  to  bring  some  clarity  to  this  "hot"  topic. 
Specific  Aims,  Methods  and  Results 

We  considered  reconstructing  an  ancestral  protein  from  an  organism  near  the  base  of 
the  bacterial  tree  to  measure  its  thermal  stability  as  a  way  of  shedding  light  on  the  issue  of 
ancient  thermophily.  Any  attempt  to  reconstruct  ancestral  sequences  requires  the  use  of 
extant  sequences  as  the  basis  for  reconstructions.  Completely  sequenced  genomes  were 
used  to  elucidate  the  best  possible  candidates  of  extant  sequences.  A  database  was 
generated  that  contained  the  genomes  of  all  twelve  completely  sequenced  organisms  to 
date  as  of  09-04-98.  The  database  was  then  divided  into  homologous  (paralogous  and 
orthologous)  translated  gene  families  based  on  pairwise  comparisons  of  all  genes.  We 
identified  all  families  of  proteins  that  had  representatives  in  all  three  kingdoms.  For  each 
family,  a  N-J  gene  tree  was  generated  using  PAM  (Accepted  Point  Mutations  per  100 
amino  acids)  distances.  A  table  was  subsequently  generated  that  contained  a  single  score 
for  all  individual  families.  This  score  represented  the  distance  (PAM  value)  between  the 
two  most  distantly  related  sequences  within  the  gene  tree  for  a  given  family.  For 
example,  the  lactate  dehydrogenase  (LDH)  family  had  a  score  of  165.  This  means  that 
the  two  most  distantly  related  LDHs  contain  165  estimated  mutations  per  one  hundred 
positions  based  on  the  tree  for  this  family.  However,  interpretation  of  these  scores 
requires  some  careful  analysis.  The  size  of  a  gene  can  affect  PAM  scores  such  that  small 
genes  can  have  more  functional  constraints  than  larger  genes.  This  means  that  although  a 
large  gene  could  have  a  higher  PAM  score  than  a  small  gene,  the  small  gene  may  have 
more  mutations  in  the  functional  domain  whereas  mutations  in  the  large  gene  could 
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congregate  within  loop  and  turn  regions  (nonfunctional  domains).  Also,  a  family  can 
contain  many  paralogous  genes  that  may  be  under  different  selective  forces,  thus  resulting 
in  a  higher  overall  PAM  score  for  the  family  due  to  more  mutations  (sequence 
differences)  among  the  loci. 

With  this  in  mind,  the  table  containing  scores  for  each  gene  family  was  analyzed.  The 
family  with  the  lowest,  and  thus  most  interesting,  score  (61)  was  a  hypothetical  ethylene- 
responsive  protein.  However,  since  we  are  interested  in  testing  the  ancestral 
characteristics  of  an  extant  protein,  this  family  is  obviously  of  little  interest  because  its 
function  is  unknown.  The  second  best  score  (100)  was  from  the  adenylylsulfate  3- 
phosphotransferase  family.  This  family  contained  only  6  representatives.  This  family 
also  proved  to  be  of  little  interest  because  it  contained  so  few  members.  The  family  that 
represented  the  most  potential  for  ancestral  reconstruction  was  EF-Tu.  This  family  had 
the  thirty-fifth  best  score  (133).  It  contained  members  from  all  12  of  the  complete 
genomes.  Swiss-Prot  had  over  80  EF-Tu  sequences  in  its  database,  which  would  enable 
us  to  generate  a  phylogenetic  tree  that  contains  many  branches.  Also,  EF-Tu  has  been 
used  for  many  phylogenetic,  structural,  and  biochemical  studies. 

Elongation  Factor  Tu  (bacteria)/Elongation  Factor  1  alpha  (archaea  and  eukaryota)  is 
a  GTPase  family  member  involved  in  cellular  function.  EF-Tu  forms  a  complex  with 
GTP  that  in  turn  favors  the  binding  of  an  aminoacyl-tRNA,  Figure  1-3.  This  ternary 
complex  binds  to  mRNA-programmed  ribosomes  delivering  aminoacyl-tRNA  to  the 
ribosomal  A  site  (for  review  see  Czworkowski  and  Moore,  1996).  The  correct  codon- 
anticodon  interaction  alters  the  conformation  of  both  the  aminoacyl-tRNA  and  EF-Tu,  by 
way  of  GTP  hydrolysis.  The  EF-Tu/GDP  complex  then  dissociates  due  to  a  subsequent 
low  affinity  for  aminoacyl-tRNA  and  the  ribosome.  Sequences  have  been  determined  for 
many  species  and  a  robust  tree  can  be  constructed  without  large  taxon  gaps  in  it.  The 
biochemistry  of  EF-Tu  has  been  studied  for  over  three  decades  resulting  in  a  clear 
understanding  of  the  functional  aspects  of  the  protein  (Negrutskii  and  El'skaya,  1998). 
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Since  elongation  factors  are  involved  in  translation,  this  may  presumably  avoid  problems 
associated  with  lateral  gene  transfer.  EF-Tu  proteins  from  thermophiles  are  thermostable 
whereas  their  mesophilic  counterparts  are  not  thermostable. 

EF - Ts  GTP 


Nascent 
polypeptide 


Aminoacyl-tRNA 


Peptidyl-tRNA  Aminoacyl-tRNA 

EF-Tu.CTI' 


Figure  1-3.  Life  cycle  of  elongation  factor.  Adapted  from  Voet  and  Voet  (1995) 


Reconstructing  a  protein  at  the  base  of  the  universal  tree  requires  many  assumptions. 
To  decrease  some  of  these  assumptions,  we  attempted  to  reconstruct  the  common 
ancestor  of  the  bacterial  lineage.  There  are  two  main  reasons  for  doing  this.  First, 
reconstructing  sequences  at  the  base  of  any  tree  without  an  outgroup  is  not  possible. 
Seeing  that  methods  to  root  the  universal  tree  are  sketchy  at  best,  using  a  single  domain  of 
the  tree  permits  the  other  domains  to  act  as  outgroups.  Second,  the  bacterial  domain  is 
believed  to  be  the  oldest  of  the  three  domains.  The  use  of  the  bacterial  lineage  would 
enable  us  to  have  more  confidence  (less  variability)  in  the  reconstruction,  yet  go  far  back 
in  time.  A  total  of  5 1  complete  bacterial  EF-TU  sequences  were  retrieved  from  Genbank 
and  Swiss-Prot  (Figure  1-4).  The  sequences  were  aligned  by  Darwin  and  ClustalW.  The 
two  programs  generated  highly  similar  alignments.  Some  final  adjustments  to  the 
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sequences,  which  included  minor  cropping  and  indel  rearrangements,  were  done  by  hand. 
The  final  alignment  had  51  sequences  with  409  positions. 

The  next  step  in  the  analysis  was  to  generate  a  tree  topology  using  the  aligned 
sequences.  This  can  be  done  using  any  one,  or  combination,  of  a  number  of  methods. 
The  most  commonly  used  methods  are  the  parsimony,  distance  and  maximum  likelihood 
approaches.  Although  the  maximum  likelihood  method  is  regarded  as  the  most 
statistically  grounded  approach  (see  below),  it  would  take  too  much  computational  time 
to  resolve  a  topology  for  5 1  sequences.  One  way  around  this  problem  is  to  narrow  the 
"tree  space"  that  maximum  likelihood  must  search  to  find  the  best  topology.  This  is 
accomplished  by  giving  maximum  likelihood  a  constrained  tree.  A  constrained  tree 
forces  some  taxa  to  group  together  whereas  other  taxa  are  free  to  form  various  groupings 
or  relationships.  The  parsimony,  distance  and  "approximate"  maximum  likelihood 
methods  were  used  to  generate  a  constrained  tree. 

The  principle  of  parsimony  is  quite  simple.  This  method  searches  for  a  tree  that 
requires  the  smallest  number  of  evolutionary  changes  to  explain  the  differences  observed 
among  the  taxa.  Parsimony  relies  on  what  are  known  as  informative  sites.  A  site  is 
considered  phylogenetically  informative  only  if  it  favors  some  trees  over  other  trees. 
Given  a  topology,  one  can  count  the  minimum  number  of  replacements  (between  extant 
and  ancestral  nodes)  to  generate  the  tree  value.  Therefore,  the  most  parsimonious  tree  is 
the  topology,  or  set  of  topologies,  that  contains  the  fewest  number  of  substitutions  (Fitch, 
1971).  Unfortunately  computer  simulations  have  shown  that  this  method  is  less  reliable 
when  there  is  variation  in  branch  lengths  and  when  multiple  substitutions  have  taken 
place  at  a  given  site  (Tateno  et  al,  1994).  This  computationally  quick  method  works  best 
to  determine  simple  evolutionary  relationships.  PAUP  4.0  beta  version  was  implemented 
to  find  such  relationships  among  the  taxa  (Swofford,  1998).  Bootstrap  analysis 
(Felsenstein,  1985)  using  parsimony  revealed  many  statistically  relevant  relationships 
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OTU        Speices  Name  Entry  Name 

Spirochetes 

TPAL  Treponema  pallidum 

BBUR      Borrelia  burgdorferi 

TREHY     Treponema   hyodysenteriae 

Thermo toga 

THEMA      Thermotoga   maritima  P13537 

FERIS     Fervidobacterium   islandicum   050340 

Deinococci/Thermus 

THETH     Thermus  aq.  (thermophilus) 

THEAQ     Thermus  aquaticus 

DEISP     Deinonema   sp 

Bacteroides 


083217 
P50062 
P52854 


P07157 
Q01698 
P33168 


BACFR     Bacteroides   fragilis 

CYTLY     Cytophaga   lytica 

TAXOC     Taxeobacter  ocellatus 

FLAFE     Flavobacterium  ferrugineum 

CHLVI     Chlorobium   vibrioforme 

Gram  Positive  (Low  G+C) 

BACSU 

BACST 

MYCGA 

MGEN 

MPNED 

UREUR 

MYCHO 

STROR 

HPYL 


P33165 
P42474 
P42480 
P42476 
P42473 

Bacillus  subtilis  P33166 

Bacillus  stearothermophilus   050306 

P18906 
P13927 
P23568 
P50068 
P22679 
P33170 
P56003 

P18668 
P33171 
P74227 
P13552 

P75022 
P42479 
P48865 
P33167 
P42481 
P48864 
P21694 
P02990 
P02990 
P43926 
P33169 
P42482 
069303 

P09953 
P42471 
P42439 
X98830 
U67308 
P95724 
033594 
P31501 
P30768 

050293 
066429 
066429 

D32139 

P07810 
P17196 


Mycoplasma   gallisepticum 
Mycoplasma   genitalium 
Mycoplasma  pneumoniae 
Ureaplasma    urealyticum 
Mycoplasma   hominis 
Streptococcus   oralis 
Helicobacter  pylori 
Cyanobacteria 

ANANI     Anacystis   nidulans 
SYNP7     Synechococcus  sp. 
SYNP3     Synechocystis  sp. 
SPIPL     Spirulina  platensis 
Proteobacteria (purple) 
AGRTU     Agrobacterium   tumefaciens 
STIAU     Stigmatella   aurantiaca 
RICPR     Rickettsia  prowazekii 
BDRCE     Burkholderia   cepacia 
THICU     Thiobacillus  cuprinus 
NEIGO     Neisseria    gonorrhoeae 
SALTY     Salmonella    typhimurium 
EC0LI1    Escherichia    coli 
EC0LI2    Escherichia   coli 
HINF      Haemophilus  influenzae 
SHEPU     Shewanella  putrefaciens 
WOLSU     Wolinella   succinogenes 
CAMJE     Campylobacter   jejuni 
Gram  Postive  (High  G+C) 

Micrococcus  luteus 
Brevibacterium  linens 
Corynebacterium  glutamicum 
Planobispora   rosea 
Planobispora   rosea 
Streptomyces   cinnamoneus 
Streptomyces  aureofaciens 
Mycobacterium  tuberculosis 
Mycobacterium  leprae 

Aquifex  pyrophilus 
Aquifex  aeolicus 
Aquifex  aeolicus 


MICLU 

BRELN 

CORGL 

PLARO_A 

PLAR0_B 

STRCJ 

STRAU 

MTUB 

MYCLE 

Aquifex 

AQUPY 

AQUAE 1 

AQUAE 2 

Eukaryota 

GLUPL     Glugea  plecoglossi 

Archaebacteria 

METVA     Methanococcus  vannielii 

SULAC     Sulfolobus   acidocaldarius 


SWISS 

-PROT 

Primary 

accession 

EFTU 

JTREPA 

EFTU 

"borbu 

EFTU 

_TREHY 

EFTU 

THEMA 

EFTU" 
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EFTU_ 

THETH 

EFTU" 
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EFTU] 
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EFTU_ 
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EFTU_ 
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EFTU_ 

TAXOC 

EFTU 

"flafe 

EFTU^ 

"CHLVI 

EFTU 
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EFTU 

"bacst 

EFTU_ 

~MYCGA 

EFTU" 

"mycge 

EFTU^ 

MYCPN 

EFTU" 

UREUR 

EFTU_ 

"mycho 

EFTU_ 

STROR 

EFTU] 

"helpy 

EFTU 

ANANI 

EFTU_ 

~SYNP7 

EFTU 

SYNY3 

EFTU[ 

SPIPL 

EFTU_ 

AGRTU 

EFTU" 

[STIAU 

EFTU 
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EFTU] 
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EFTU" 

"thicu 

EFTU^ 

"neigo 

EFTU" 

[salty 

EFTU~ 
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[ecoli 

EFTU_ 

HAEIN 
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Figure  1-4.  List  of  taxa  names  used  in  the  Elongation  factor  analyses . 
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within  the  data  set.  Although  this  method  generated  many  relationships  among  the  taxa 
(data  not  shown),  it  did  not  establish  enough  relationships  to  generate  a  constrained  tree 
suitable  for  a  maximum  likelihood  analysis. 

The  approximate  maximum  likelihood  method  (Adachi  and  Hasegawa,  1 996)  was 
used  to  determine  more  relationships  among  the  bacterial  taxa.  This  method  first 
generates  a  Neighbor- Joining  (N-J)  distance  tree  and  then  uses  the  maximum  likelihood 
approach  to  search  the  tree  space  around  the  N-J  tree  (see  below  for  explanation  on  both 
distance  and  maximum  likelihood  methods).  This  analysis  works  on  the  assumption  that 
the  N-J  tree  tends  to  be  very  similar  to  the  maximum  likelihood  tree.  Unfortunately  we 
are  not  guaranteed  to  find  the  best  tree  using  this  method.  It  is  however  a  good  method  to 
generate  relationships  from  sequences  that  are  more  distantly  related  than  those  generated 
by  parsimony.  The  RELL  Bootstrap  analysis  method  using  approximate  maximum 
likelihood  as  implemented  in  Molphy  was  used  on  the  EF  data  set.  The  analysis 
generated  many  evolutionary  relationships  (data  not  shown),  in  addition  to  those 
generated  by  parsimony. 

For  a  variety  of  applications,  evolutionary  distances  between  sequences  are  needed. 
Distance  measures  are  generally  constructed  so  that  the  distance  between  two  sequences 
scales  linearly  with  the  number  of  mutations  that  occurred  during  the  evolutionary  history 
separating  them.  Simple  tools  for  estimating  these  distances  assume  that  all  sites  in  the 
sequence  change  at  the  same  rate.  In  fact,  a  simple  inspection  of  any  multiple  sequence 
alignment  will  show  that  this  is  not  the  case.  Some  positions  are  more  variable  than 
others.  This  fact,  as  it  turns  out,  makes  distances  calculated  using  these  simple  tools 
inaccurate.  Modeling  this  rate  heterogeneity  among  sites  can  overcome  these  associated 
problems. 

Rate  heterogeneity  is  a  concept  that  dates  back  to  1967  when  Fitch  and  Margoliash 
were  analyzing  cytochrome  c  sequence  data.  They  discovered  that  the  number  of 
substitutions  per  site  did  not  follow  a  Poisson  distribution.  Up  to  this  time,  researchers 
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proposed  that  all  sites  in  a  sequence  change  at  the  same  rate  (rate  homogeneity)  and 
therefore  conform  to  the  Poisson  distribution.  Further  analysis  revealed  sites  tend  to 
follow  a  negative  binomial  (Golding,  1983).  This  so-called  gamma  distribution  (rate 
heterogeneity)  more  accurately  models  rate  variation.  To  better  understand  rate  variation 
among  sites  consider  two  sets  of  sequences,  each  containing  two  DNA  sequences.  The 
first  set  contains  sites  that  are  allowed  to  mutate  anywhere  along  the  sequence.  The 
second  set  contains  sites  in  which  only  90%  of  the  sequences  can  mutate;  therefore  they 
always  have  10%  identity.  Now,  after  infinite  time,  the  first  set  will  contain  25% 
sequence  identity  due  to  random  chance  using  four  nucleotides.  After  infinite  time  and 
the  same  mutation  rates  as  the  first  set,  the  second  set  will  have  32.5%  sequence  identity 
(0.25  x  0.90  +  0.10).  Estimates  of  evolutionary  distances  can  be  biased,  and  therefore 
flawed,  unless  this  rate  heterogeneity  is  taken  into  account.  In  reality,  unlike  this  simple 
example,  substitution  rates  are  estimated  from  sequence  similarity  data.  It  is  not  possible 
to  generate  a  true  tree  topology  if  the  number  of  substitutions  between  sequences  is 
incorrectly  calculated.  Yang  (1996)  gives  a  nice  review  of  the  gamma  distribution  and  its 
main  parameter  alpha.  A  large  alpha  represents  little  rate  variation  (some  pseudogenes) 
and  a  small  alpha  value  represents  extreme  rate  variation  (globin  genes).  Sullivan  et  al. 
(1996)  showed  that  the  estimate  of  alpha  based  on  a  well-corroborated  tree  topology  is 
well  outside  the  distribution  of  estimates  derived  from  random  trees.  However,  Yang  et 
al.  (1995a)  demonstrated  that  it  is  possible  to  accurately  estimate  alpha  as  long  as  the 
given  topology  is  a  rough  approximation  of  the  true  tree.  A  well-analyzed  topology  based 
on  RNA  sequences  (Pace,  1997)  (Figure  1-2)  was  used  to  estimate  the  EF-Tu  alpha  value 
using  Yang's  PAML  program.  An  alpha  value  of  0.47  was  obtained.  Alpha  was  also 
estimated  within  two  different  conditions.  First,  some  of  the  branches  were  shuffled 
around  in  the  Pace  topology.  Second,  a  different  topology  than  Pace's  (Baldauf  et  al, 
1996)  was  used  on  the  data  set  and  contained  only  13  of  the  51  sequences.  Both  of  these 
alternative  methods  generated  alpha  estimates  between  0.48-0.50. 
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Distance  methods  can  have  advantages  over  parsimony  for  a  number  of  reasons.  For 
example,  the  methods  can  more  easily  account  for  multiple  substitutions  at  sites  and  they 
can  better  resolve  a  topology  with  many  varying  branch  lengths.  These  are  integral  to  the 
EF  analysis  because  analyzing  very  ancient  divergences  can  encounter  many  lineages 
with  substantially  different  branch  lengths.  To  create  a  distance  tree,  a  distance  matrix 
based  on  the  taxa  is  generated  and  then  the  taxa  are  "joined"  together.  A  distance  matrix 
is  generated  by  pairwise  sequence  comparisons  of  the  given  taxa.  Each  sequence  is 
compared  to  every  other  sequence  and  branch  lengths  (substitutions)  are  estimated  for 
each  pair.  The  most  commonly  used  approach  is  the  least-squares  method  (Rzhetsky  and 
Nei,  1993).  Although  this  method  is  not  computationally  difficult,  it  is  laborious  to 
explain.  The  reader  is  referred  to  Fitch  and  Margoliash  (1967)  for  a  detailed  explanation 
(their  method  is  the  same  as  the  least-squares  method  when  one  uses  five  or  fewer  taxa). 
Such  methods  result  in  the  pairwise  distances  between  sequences.  The  distances  have 
been  corrected  for  multiple  hits  and  are  therefore  not  the  observed  distances.  The  next 
step  is  to  form  a  topology  based  on  the  distance  matrix  values.  The  most  common 
method  to  generate  a  topology  is  the  Neighbor- Joining  method.  However,  the  N-J 
method  is  just  an  approximate  method  of  the  Minimum  Evolution  method  (ME).  For  all 
alternative  trees  it  is  possible  to  estimate  the  lengths  of  each  branch  from  the  estimated 
pairwise  distances  and  then  calculate  the  sum  (S)  of  all  branch-length  estimates.  The 
minimum  evolution  criterion  is  to  choose  the  tree  with  the  smallest  value  of  S  (Cavalli- 
Sforza  and  Edwards,  1967).  However,  since  it  is  too  computationally  intense  to  calculate 
the  S  value  for  all  trees,  a  N-J  tree  is  initially  generated  and  then  the  minimum  evolution 
criterion  is  used  to  search  the  tree  space  around  the  N-J  tree.    Instead  of  creating  a 
distance  matrix  using  the  least-squares  method,  we  generated  a  matrix  using  a  maximum 
likelihood  method  that  incorporated  the  alpha  value  of  0.47  (Kumar  et  ai,  1993).  This 
distance  matrix  was  subsequently  used  in  PAUP  4.0  beta  to  generate  minimum  evolution 
topologies  for  all  the  bacterial  divisions. 
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The  consensus  topologies  for  all  nine  bacterial  lineages,  generated  by  the  parsimony, 
approximate  ML,  and  ME  approaches,  were  used  to  create  a  constrained  tree  (Figure  1  -5). 
Thus,  the  number  of  taxa  have  essentially  decreased  from  51  to  nine.  In  mathematical 

terms,  the  number  of  possible  tree  topologies  decreased  from  roughly  3x10     to  about 

7 
3.5  x  10  .  The  constrained  tree  was  subsequently  used  for  a  full-blown  maximum 

likelihood  analysis. 

Maximum  likelihood  is  considered  a  higher-order  statistical  analysis  because  it 
incorporates  an  explicit  probabilistic  model  for  substitution  processes  (Felsenstein,  1981). 
Maximum  likelihood  calculates  the  transition  probability  from  one  residue  to  another  in  a 
time  interval  for  each  branch.  This  requires  a  substitution  matrix.  The  most  commonly 
used  substitution  matrix  for  protein  data  is  the  MDM  published  by  Dayhoff  (1978).  The 
MDM  was  calculated  from  a  study  of  the  exchange  probabilities  derived  from  an  analysis 
of  the  evolutionary  changes  seen  in  groups  of  very  similar  proteins.  The  ML  method  can 
then  use  these  probabilities  to  calculate  the  likelihood  score  of  each  ancestral  site 
incorporating  various  parameters  (e.g.,  gamma  distribution  or  transition/transversion 
ratios).  The  product  of  the  likelihoods  for  all  the  sites  is  then  computed  and  this  result  is 
the  likelihood  score.  For  a  more  thorough  explanation  see  Felsenstein  (1988),  Hasegawa 
etal.  (1991),  and  Li  and  Gouy  (1991). 

Yang's  PAML  program  was  used  to  perform  the  maximum  likelihood  analysis 
(PAML  is  the  only  ML  program  that  can  use  the  gamma  distribution  for  amino  acid  data). 
The  estimated  likelihood  score  for  the  EF-Tu  data  based  on  the  Pace  topology  without 
using  alpha  was  -14967.  The  estimated  likelihood  score  using  an  alpha  of  0.5  was  - 
13685.  Since  these  two  tree  scores  are  highly  significantly  different  (p«0.005) 
according  to  the  log  likelihood  ratio  test,  it  is  important  that  alpha  is  incorporated  into  the 
analysis.  Unfortunately,  PAML  does  not  have  a  good  tree-searching  algorithm. 
Therefore  a  combination  of  the  Molphy 
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Figure  1-5.  Consensus  Tree  generated  from  the  analysis  of  individual  bacterial  divisions 
using  the  parsimony,  distance,  and  approximate  likelihood  methods. 
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and  PAML  programs  was  implemented  to  analyze  the  data  using  the  constrained  lineages. 
First,  Molphy  searched  for  tree  topologies.  This  analysis  utilized  the  approximate  ML 
method  incorporating  the  JTT  substitution  model  (Jones  et  al,  1992)  (similar  to  the 
Dayhoff  matrix).  The  top  2000  trees  were  saved  and  then  used  in  the  full-blown  ML 
method  in  Molphy.  This  generated  a  series  of  likelihood  scores  with  bootstrap  values. 
The  top  1 5  trees  (according  to  likelihood  scores,  bootstrap  values  and  biological 
relevance)  were  submitted  for  PAML  analysis  with  the  alpha  value  set  to  0.47. 

Unfortunately,  RELL  bootstrap  scores  revealed  that  no  single  topology  significantly 
fit  the  data  better  than  any  other  topology.  However,  four  of  the  15  topologies  did  make 
up  the  vast  majority  of  the  bootstrap  scores  (-90%).  These  four  topologies  were  very 
similar  to  the  Pace  topology  (in  fact,  one  was  the  Pace  topology).  Thus,  we  may  be  able 
to  accept  the  Pace  topology  as  representing  the  true  relationships  among  bacterial 
divisions. 

In  1995,  Yang,  Kumar  and  Nei  devised  a  new  model-based  likelihood  method  for 
reconstructing  ancestral  sequences.  This  method  followed  a  standard  statistical  theory: 
that  given  the  data  at  a  site,  the  conditional  probabilities  of  different  reconstructions  can 
be  compared  and  the  reconstruction  having  the  highest  conditional  probability  is  the  best 
estimate  (Yang  et  al. ,  1 995b).  This  new  method  was  superior  to  parsimony  for  two 
reasons.  First,  it  used  a  probability-of-substitution  matrix  to  calculate  the  chance  of  all 
various  substitutions  between  residues.  Secondly,  it  incorporated  branch  lengths  into  its 
estimates  of  reconstruction  probabilities.  This  method  was  also  unique  as  a  likelihood 
approach  because  other  likelihood  models  regarded  ancestral  character  states  as  random 
variables  (Felsenstein,  1981;  Goldman,  1990).  The  idea  of  reconstructing  ancestral 
sequences  dates  back  almost  thirty-five  years  (Pauling  and  Zuckerkandl,  1 963). 

Since  the  early  part  of  this  decade  a  number  of  research  groups  have  used  this  idea  to 
study  the  chemical  and  physiological  properties  of  ancient  proteins  (Stackhouse  etal, 
1990;  Malcolm  et  al,  1990;  Adey  et  al,  1994;  Jermann  et  al,  1995).  Seeing  that  these 


19 

studies  all  used  the  parsimony  method  to  infer  their  ancestral  sequences,  it  is  a  new 
approach  to  use  likelihood  to  infer  the  ancient  sequences. 

The  Pace  topology  was  used  with  the  EF-Tu  data  to  reconstruct  ancestral  sequences. 
Figure  1-6  shows  a  region  of  the  ancestral  sequence  at  the  ancestral  node  of  all  bacteria 
using  two  archaea  and  one  eukaryote  as  outgroups.  Using  75%  probability  as  the  cutoff, 
the  ancestral  sequence  has  36  ambiguous  residues  out  of  392  total  sites.  Based  on  crystal 
structure  data  of  the  E.  coli  and  Thermus  aquaticus  EF-Tu  (Nissen  et  al,  1995;  Polekhina 
et  al,  1996;  Kawashima  et  al,  1996)  ca.  17  of  the  36  ambiguous  residues  lie  in  regions  of 
the  protein  that  do  not  have  secondary  structures  of  helices  or  strands  (Figure  1-7).  Also, 
the  ancestral  sequence  contains,  with  high  probability,  the  residues  that  have  been  deemed 
necessary  for  proper  EF-Tu  function  (Harmark  et  al,  1990;  Cool  and  Parmeggiani,  1991; 
Weijland  and  Parmeggiani,  1993;  Cetin  et  al,  1998).  Therefore,  this  topology  gives  rise 
to  a  sequence  (even  if  we  ignore  ambiguities)  that  could  be  functionally  active  if 
synthesized  in  the  laboratory. 

The  PAML-computed  ancestral  sequences  can  be  constructed  in  the  laboratory  using 
the  Splicing-by-Overlap-Extension  PCR  (SOE-PCR).  Since  it  is  not  possible  to  know 
which  of  the  sequences  is  the  true  ancestor  of  the  bacterial  lineage,  all  possible 
combinations  of  residues  at  ambiguous  sites  in  the  proteins  should  be  generated.  This 
requires  the  construction  of  about  236,  7  x  1010,  sequences.  However,  this  is  the  case  only 
when  there  are  two  possible  residues  at  an  ambiguous  site.  The  ancestral  sequence  based 
on  Pace's  topology  contains  a  few  sites  with  three  possible  residues,  thus  requiring  the 
construction  of  a  little  more  than  2    sequences.  This  can  be  accomplished  by 
synthesizing  primers  that  can  generate  all  possible  ambiguous  residues  in  the  sequence. 
PCR  methods  could  subsequently  mutate  a  given  sequence  (whichever  extant  sequence 
most  closely  resembles  the  ancestral  sequence)  to  result  in  all  possible  combinations  of 
ancestral  sequences  (see  section  on  Bacillus  below  for  detailed  explanation). 
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Site  Freq  Data: 

229  1  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEIII:  E(1.000) 

230  1  DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDQQ:  D(l.OOO) 

231  1  WWWVWVIVVVVVVVTVVVVVVWVVVVWWWWVWVWWTTWWSDE:  V(  1.000) 

232  1  FFFFFFFFFFFFFFFFFFMFFFFFFFFFFFFFFFFFFYFFFFFFFFFMMFFFYIW:  F(1.000) 

233  1  STSSSSTSSSTTTSSSSTTTTSSSSSSSSTTSSTSTSSTSSSSSSSSTTTSTTHYY:  S (1.000) 

234  1  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIKTS:  1(1.000) 

235  1  STSTTTTSSSTTTTTSKTTTTSTTSSSTATTTTTTTSPSASSSSSSATTTTTSIII:  A(0.001) 

N(O.OOO)   K(O.OOO)  S(0.029)  T(0.969) 

236  1  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGPT  S :  G ( 1 . 0  0  0 ) 

237  1  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRGGG:  R(l.OOO) 

238  1  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGIW:  G( 1.000) 

239  1  TT.J^rp,pTTrp.j,prprpTTTTTTTTTTTTTTTTT,p.p^.prp.pr]i.p.p^  T  (1.000) 

240  1         VWVWWVVWVVVWVVVVVVVVVVVVVVWVVVVVVVVWWWVVWVVMTV:    V(l.OOO) 

241  1         VAVAAAVWGVAAVAWWAVVVVVWS  AVVAAAVAVVVVVVAVVVVVVVAAVVVV :    1(0.000) 

V(l.OOO) 

<0  A  *y  1  mmtnmmmmmmrpmmmmmmmmmfnmmmmmmmmmmmcin  .        rn   /  1         OOO^ 

243  1         GGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGTW:    G( 1.000) 

244  1         RRRRRRRRRRRRRRRRRRRRRRRRRRRGRRRRRRRRRRRRRRRRRRRRRRRRRGGG:    R(l.OOO) 

245  1        VIVIWWIIVIVIIVIAWVVIIVVVIVIIIIIIIVIVIVVIVWIWVIIIRRR:    1(0.792) 

M(0.000)    V(0.208) 

246  1    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEDEEEEEEEEEEEEEEEEEEEEEVVI:  E( 1.000) 

247  1    RRRTRRRRKRRTRRRRRRRRRRRRSRRRRRRRRRRRRRRRRRRRRRRRRRRRCSEE:  R(l.OOO) 

248  1    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTTS:  G(l.OOO) 

249  1    ISWTQVIVITVVVRKITQVVIVVIIIKKVVTVKRKIKVVWIIIIVEEVKSVGGG:  A(0.005) 

R(0.001)  N(O.OOO)  CIO. 000)  Q(O.OOO)  E(O.OOO)  G(O.OOO)  1(0.205) 
L(0.006)  K(O.OOO)  M(0.007)  F(O.OOO)  P(O.OOO)  S(0.001)  T(0.003) 
V(0.771) 

250  1         WLILVLWILAVIIWLLLVIVVIIVVILLVIVIVIILVLLIIIIVLLIVVIAIV:    A(O.OOO) 

1(0.420)    L(0.104)    M(0.008)    F(O.OOO)    T(O.OOO)    V(0.466) 

251  1         KKRHKKLKKKNNKKKRKKKQNHKKKKRKKKKRNKRKKEKKRRKKKRKKKNKKSIIL:    R(0.589) 

Q(0.001)    H(O.OOO)    K(0.409) 

252  1         VVPVVVPVVWTVPVVVIVLVVVVVVVVVVVVSVPVVKVVPPVWTVVVWVLKKK :    A  ( 0  .  005 ) 

Q(O.OOO)    1(0.000)    L(0.001)    P(0.985)    S(0.001)    T(0.001)    V(0.006) 

253  1    GGGGGGNGGGNGQGGQGNGNNGNNGGGGGNNNGGGGGGNGGGGGGGGGGNGGNPPV:  G( 1.000) 

254  1    EEDDDDDEDNDDDVEDDSEEEDEEEEDDEEEDEDDDENDDDDQEEDDQQEEEEGGG:  N(0.001) 

D(0.996)    E(0.003) 

255  1         ETEEPEEETEDAEEPETEEEEEQQEEETETTEPEEEEEEEEEEEEEEEEEETEMDD:    D( 0.002) 

Q(O.OOO)    E(0.998) 

256  1         VIVIWIIIWVVAVIIWWIVVIVWVVVIVVVVIWVVWWVVVIVIIVWV:    A(O.OOO) 

1(0.007)    V(0.993) 

257  1         EEEEEEEEEEDDEEEEEEEEEEDDEEEEEDDEEEEEEEEEEEEEEEEEEESEEEEE:    E(l.OOO) 

258  1        IIIIIIIIIIIIIIIIIIIIIIIIIIILIIIIIIIIIIIIIIIIIIIIIIIIIIKK:    1(1.000) 

Figure  1-6.  Example  of  ancestral  sequence  output  generated  by  PAML  for  the  ancestral 
node  of  the  complete  bacterial  lineage.  "Site"  refers  to  the  amino  acid  position  in  the 
protein  sequence.  "Frequency"  refers  to  the  number  of  times  that  the  given  data  string  for 
that  site  appears  in  the  protein  sequence.  "Data"  refers  to  the  residue  of  all  51  sequences 
at  the  given  site.  The  probabilities  for  the  ancestral  residues  are  given  after  the  data 
(residues  given  in  single  letter  amino  acid  code). 
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E.coli 
Thermus 


E.coli 
Thermus 


1  21  51 

MSKEKFERTK  PHVNVGTIGH  VDHGKTTLTA  AITTVLAKTY  G-GAARAFDQ  IDNAPEEKAR 

MAKGEFIRTK  PHVNVGTIGH  VDHGKTTLTA  ALTYVAAAEN  PNVEVKDYGD  IDKAPEERAR 
*.*  .*  ***  **********  **********  *.*  *  *         ...   **.****.** 

?  ?   ?   ???? 

EEEEEEEEE       HHHHHH  HHHHHHHHHH 
E  EEEEEEEEE       HHHHHH  HHHHHH 


E.coli 
Thermus 


61  81  111 

GITINTSHVE  YDTPTRHYAH  VDCPGHADYV  KNMITGAAQM  DGAILVVAAT  DGPMPQTREH 
GITINTAHVE  YETAKRHYSH  VDCPGHADYI  KNMITGAAQM  DGAILVVSAA  DGPMPQTREH 


******.***  *.* 


t**.*  *********. 


t******  *******.*.  ********** 


E.coli 
Thermus 


EEE  EEEEEEEEEE  EEE   HHHHH  HHHHH 
EEE  EEEEEEEEEE  EEE   HHHHH  HHHHH 


EEEEEEEE 
EEEEEEEEE 


HHHH 
HHHHHH 


E.coli 
Thermus 


E.coli 
Thermus 


121  141  171 

ILLGRQVGVP  YIIVFLNKCD  MVDDEELLEL  VEMEVRELLS  QYDFPGDDTP  IVRGSALKAL 
ILLARQVGVP  YIVVFMNKVD  MVDDPELLDL  VEMEVRDLLN  QYEFPGDEVP  VIRGSALLAL 
****  ***.*  ******.**   **.****.  *  ..*****  ** 

?? 
HHHHHH  HHHHHHHHHH  HH        E  EEEEHHHHHH 
HHHHHH  HHHHHHHHHH  HH       EE  EEEEEHHHHH 


***  ******  **.**.** 


HHHHHHH   E  EEEEEEEE 
HHHHHHH   E  EEEEEEEE 


E.coli 
Thermus 


E.coli 
Thermus 


181  201  231 

E -GDAEWEAKI  LELAGFLDSY  IPEPERAIDK  PFLLPIEDVF  SISGRGTVVT 

EEMHKNPKTK  RGENEWVDKI  WELLDAIDEY  IPTPVRDVDK  PFLMPVEDVF  TITGRGTVAT 


*.  **   **   ** 


.*  *  **  *  *  .**  ***.*.****  .*.*****  * 


H 
HHHHH 


????  ??  ?  ? 
HHHHHH  HHHHHHHHH 
HHHHHH  HHHHHHHHHH 


7  7 


EEEEEEEEE  EEEEEEEEEE 
EEEEEEEEEE  EEE  EEEEEE 


E.coli 
Thermus 


E.coli 
Thermus 


241  261  291 

GRVERGIIKV  GEEVEIVGIK  -ETQKSTCTG  VEMFRKLLDE  GRAGENVGVL  LRGIKREEIE 
GRIERGKVKV  GDEVEIVGLA  PETRKTVVTG  VEMHRKTLQE  GIAGDNVGLL  LRGVSREEVE 


**.***  .**  *.******. 

??         ? 
EEE   EEEEE   EEEEEEE 
EEEEEEEEE   EEEEEEEE 


t*.*.   **  ***  **  *.*  *  **.***.*  *** .  ***.* 

??  ?      ?              ? 

EEEEEEEE  EEEEEEEEEE  EEE  EEEEEE  EE 

EEEEEE  EEEE  E  EEE  EEEEEE  EE 


E.coli 
Thermus 


301  321  351 

RGQVLAKPGT  IKPHTKFESE  VYILSKDEGG  RHTPFFKGYR  PQFYFRTTDV  TGTIELPEGV 
RGQVLAKPGS  ITPHTKFEAS  VYILKKEEGG  RHTGFFTGYR  PQFYFRTTDV  TGVVRLPQGV 


**  ***** 


*  ******* 


****  *.***  ***  **  ***  **********  ** 


E.coli 
Thermus 


EEEEE 
EEEEE 


EEEEEE  EEEE 
EEEEEEEEE  EEEE 


EEEE   E  EEEEEEEEEE  EEEEEE 
EEE  EEEEEEEEEE  EEEEEE 


E.coli 
Thermus 


361  381  401 

EMVMPGDNIK  MVVTLIHPIA  MDDGLRFAIR  EGGRTVGAGV  VAKVLS 
EMVMPGDNVT  FTVELIKPVA  LEEGLRFAIR  EGGRTVGAGV  VTKILE 


******** . 


*  ** . * . 


.*******  **********  *.*.* 


E.coli 
Thermus 


EEEE  EEEEE  EEEEEE 

EEEEE  EEEEEEEEEE 


EEEEEE  EEEEEEEEEE  EEE 
EEEEEEE  EEEEEEEEEE  EEEEE 


Figure  1-7.  Alignment  and  secondary  structure  of  E.  coli  and  T.  aquaticus.  Question 
marks  indicate  ambiguities  in  the  bacterial  ancestral  sequence.  Asterisks  indicate 
sequence  identity,  two  dots  indicate  high  sequence  similarity,  and  one  dot  indicates 
moderate  sequence  similarity.  "H"  indicates  helix  and  "E"  indicates  strand. 
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The  PCR  products  can  be  cloned  into  the  expression  vector  pET24C  (Novogen).  This 
vector  provides  a  His-tag  to  facilitate  purification  of  the  expressed  protein  on  a  nickel- 
nitrilotriacetic  acid  column  (Woriax  et  al,  1995).  The  resulting  plasmid  could  then  be 
transformed  into  the  strain  of  E.  coli  HB101.  This  would  result  in  the  expression  and 
subsequent  purification  of  the  proteins.  Seeing  that  the  ancestral  sequences  may  be 
thermophilic,  we  wondered  as  to  whether  an  exogenous  thermophilic  protein  could  be 
isolated  from  a  mesophilic  host.  Not  only  is  the  answer  yes,  in  addition  the  answer  is  yes 
for  EF-Tu.  Orsola  Tiboni  and  colleagues  (Tiboni  et  al,  1989;  Sanangelantoni  et  al, 
1996)  have  studied  EF-Tu  in  Thermotoga  maritina.  T.  maritima  is  a  bacterial  species 
with  an  optimal  growth  temperature  between  80-85°C.  These  researchers  have  isolated 
T.  maritima  EF-Tu  that  was  expressed  in  E.  coli.  They  showed  that  the  protein  was 
active  through  a  series  of  GDP/GTP  assays.  Another  research  group  consisting  of 
Mathias  Sprinzl  and  colleagues  has  done  the  same  work  using  Thermus  thermophilus  EF- 
Tu  (Ahmandian  et  al,  1991;  Blank  et  al,  1995;  Nock  et  al,  1995).  T.  thermophilus  is  a 
bacterial  species  with  an  optimal  growth  temperature  at  75  °C.  In  addition,  the  archaeal 
thermophilic  EF-Tu  from  Sulfolobus  solfataricus  has  been  expressed  and  purified  from  E. 
coli  (Masullo  et  al,  1997).  These  results  all  indicate  that  the  ancestral  sequences  can  be 
isolated  whether  or  not  they  are  thermophilic. 

Once  the  PCR  products  are  cloned  and  the  library  of  ancestral  sequences  is  expressed, 
the  thermostability  of  the  proteins  can  be  tested.  This  is  accomplished  through  a  series  of 
assays;  binding  of  GDP  and  GTP,  and  testing  the  intrinsic  GTPase  activity  of  EF-Tu 
(Fansanoefa/.,  1982;  Tiboni  et  al,  1989;  Ahmadianet  al,  1991;  Masullo  et  al,  1994). 
The  nitrocellulose  filter  method  can  be  used  to  test  EF-Tu's  ability  to  bind  nucleotides 
(Arai  et  al,  1972;  Masullo  et  al,  1991).  The  reaction  mixtures  can  contain  a  solution  of 
20  mM  Tris/HCl  pH  7.8,  10  mM  MgCl2,  50  mM  KC1,  7  mM  2-mercaptoethanol,  50  uM 
[  H]  GDP  and  2  uM  ancestral  EF-Tu.  The  reactions  can  be  incubated  at  various 
temperatures,  filtered  through  a  nitrocellulose  membrane,  washed,  and  the  radioactivity 
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determined.  The  same  procedure  can  be  used  for  [3H]  GTP.  These  two  sets  of 
experiments  would  yield  data  that  in  turn  are  graphed  out  to  give  a  curve  (temperature  on 
the  x-axis  and  %  nucleotide  bound  on  the  y-axis)  that  would  indicate  the  optimal 
nucleotide  binding  temperature. 

EF-Tu  hydrolysis  of  GTP  requires  the  presence  of  ribosomes  and  aa-tRNA  (Kaziro, 
1978).  However,  it  has  been  demonstrated  that  EF-Tu  can  hydrolyze  GTP  in  the  presence 
of  kirromycin  or  mono/divalent  cations  (Fasano  et  al.,  1978;  Fasano  et  al,  1982;  Masullo 
et  al.,  1994).  Since  kirromycin  is  unstable  at  high  temperatures,  it  is  best  to  use  cations. 
Based  on  the  above  research,  K+  and  Na+  monovalent  cations  have  the  greatest  effect  on 
stimulating  the  intrinsic  EF-Tu  GTPase  activity.  For  example,  the  assay  can  proceed  as 
follows:  20  mM  Tris/HCl  pH  7.8, 1  mM  dithiothreitol,  10  mM  MgCl2,  3.6  M  NaCl,  0.5 
uM  EF-Tu  and  50uM  [y-32P]  GTP.  The  reactions  are  allowed  to  incubate  at  various 
temperatures.  The  reactions  are  stopped  by  the  addition  of  HCIO4  and  this  results  in  the 
formation  of  phosphododecamolybdate.  This  complex  is  then  extracted  with  isopropyl 
acetate  and  an  aliquot  of  the  organic  phase  is  dried  on  filter  paper.  The  radioactivity  is 
subsequently  measured  by  a  scintillation  spectrometer  (Fasano  and  Parmeggiani,  1981). 
Again,  a  graph  can  be  generated  showing  the  optimal  temperature  of  GTP  hydrolysis  for 
the  ancestral  sequences. 

The  previously  mentioned  assays  enable  one  to  infer  the  intracellular  temperature  at 
which  the  tested  ancestral  EF-Tu  sequence  functioned.  This  can  in  turn  be  used  to  make 
assumptions  regarding  the  temperature  of  the  environment  that  the  bacterial  ancestor 
lived  in.  However,  the  major  difficulty  with  these  sets  of  experiments  is  not  knowing 
which  of  the  generated  sequences  is  the  true  ancestral  sequence.  This  can  hopefully  be 
overcome  by  calculating  two  statistical  distributions.  Consider  the  following.  Assume 
that  we  know  the  ancestral  sequence  from  which  all  bacterial  divisions  are  derived 
(mesophilic  and  thermophilic  divisions).  We  also  know  that  the  sequence  is 
thermophilic.  We  are  interested  in  knowing  the  effect  mutations  have  on  the 
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thermophilicity  of  the  protein.  It  is  possible  to  analyze  this  problem  in  two  ways.  First, 
we  might  randomly  mutate  the  ancestral  sequence  and  see  what  fraction  of  the  mutants 
remain  thermostable.  Second,  we  might  mutate  only  those  positions  in  the  ancestral 
sequence  that  are  different  from  the  extant  sequences  and  test  thermophilicity.  Obviously 
the  latter  approach  is  more  desirable  because  we  are  mutating  positions  within  the  same 
evolutionary  space/path  as  the  mutations  that  resulted  in  change  from  the  ancestor  to  the 
extant  species.  This  approach  will  generate  a  distribution  of  temperatures  because  some 
mutations  will  decrease  thermophilicity  whereas  others  may  increase  or  cause  no  change. 
In  this  example  a  distribution  was  generated  where  the  ancestral  sequence  was  assumed  to 
be  thermophilic.  The  same  set  of  experiments  can  be  performed  with  the  ancestral 
sequence  being  mesophilic.  Again,  a  distribution  of  temperatures  is  generated  based  on 
mutations  within  a  known  evolutionary  space/path.  The  two  distributions  between  the 
thermophilic  and  mesophilic  ancestral  sequence  examples  should  be  different.  The 
complexity  of  protein  dynamics  may  lead  one  to  believe  that  it's  simpler  to  mutate  a 
thermophile  into  a  mesophile  than  vice  versa.  The  mesophilic  ancestor's  distribution  will 
fall  more  around  the  true  temperature,  whereas  the  thermophilic  ancestor's  distribution 
will  be  skewed  more  towards  mesophilic  temperatures.  Thus,  one  could  generate  EF-Tu 
ancestral  sequences  at  the  base  of  the  bacterial  lineage,  test  thermophilicity,  create  a 
distribution,  and  compare  this  distribution  to  the  aforementioned  distributions  to 
extrapolate  whether  the  ancestral  sequence  was  mesophilic  or  thermophilic.  The  question 
than  becomes  how  will  the  two  standard  distributions  be  generated  from  which  the 
sequences  can  be  compared. 

Based  on  the  same  principles  as  the  above  examples,  it  is  possible  to  take  two  closely 
related  sequences,  where  one  is  mesophilic  and  the  other  is  thermophilic,  reconstruct  the 
ancestral  sequence  and  generate  a  distribution.  Two  possible  examples  of  this  lie  within 
the  Bacillus  and  Methanococcus  genera.  Each  of  these  two  genera  contains  two  closely 
related  sequences  with  one  sequence  being  mesophilic  and  the  other  being  thermophilic. 
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Fortuitously,  the  Bacillus  ancestor  is  generally  believed  to  be  mesophilic,  whereas  the 
Methanococcus  ancestor  is  generally  believed  to  be  thermophilic.  Therefore  the  two 
types  of  standard  distributions  can  be  generated. 

The  Bacillus  subtilus  (optimal  growth  temperature  35-40°C)  and  Bacillus 
sterothermophilus  (optimal  growth  temperature  60-65°C)  EF-Tu's  have  been  cloned  and 
sequenced  (Ludwig  et  al,  1990;  Krasny  et  al,  1998).  Based  on  Pace's  topology,  the 
probabilistic  ancestral  EF-Tu  sequence  for  these  two  species  has  been  generated  by 
PAML.  Using  75%  probability  as  a  cutoff,  the  ancestral  reconstructed  sequence  contains 
16  ambiguous  positions,  or  roughly  65,000  possible  sequences.  Six  of  the  ambiguous 
sites  are  in  the  same  positions  as  the  ancestral  sequence  generated  at  the  base  of  the 
bacterial  lineage.  The  sixteen  ambiguous  sites  fall  into  eleven  distinct  sites  or  clusters. 
For  example,  two  sites  are  located  at  positions  6  and  8,  one  site  is  located  at  position  73, 
and  three  sites  are  located  at  positions  183,  189,  and  193.  Since  it  is  not  known  which  are 
the  correct  residues  at  the  ambiguous  positions,  one  needs  to  generate  sequences  with  all 
possible  combinations  of  the  residues.  For  a  detailed  explanation  let  us  consider  only  the 
first  six  positions  that  were  just  mentioned.  Figure  1-8  shows  a  schematic  of  how  to 
generate  these  ancestral  sequences.  Four  PCR  'Primers'  are  required  to  make  a  full-length 
product.  The  first,  'Primer  #1',  will  cover  the  amino-terminus  end  and  extend  through 
amino  acid  positions  6  and  8.  When  this  'Primer'  is  synthesized  it  will  contain  variations 
that  account  for  the  ambiguities  at  positions  6  and  8.  It  will  contain  50%  A  and  50%  T 
corresponding  to  the  third  position  of  codon  6  because  GAT  codes  for  Asp  and  GAA 
codes  for  Glu,  which  are  the  two  most  probabilistic  residues  at  the  position  according  to 
PAML,  71.4%  and  28.6%  respectively.  The  first  position  of  codon  8  will  contain  50%  A 
and  50%  T  on  the  corresponding  'Primer'  because  ACC  codes  for  Thr  and  TCC  codes  for 
Ser.  Therefore,  'Primer  #1'  will  actually  contain  a  mixture  of  four  distinct  types  of 
primers  to  satisfy  the  different  combinations  of  residues,  and  each  primer  will  constitute 
25%  of  the  mixture.  'Primer  #2'  will  have  to  be  synthesized  twice.  Position  73  contains  3 
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Figure  1-8.  Schematic  of  the  PCR  reactions  that  would  generate  a  segment  of  the 
Bacillus  ancestral  sequence. 
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residues  that  cannot  be  synthesized  by  single  nucleotide  replacements  without  generating 
intermediate  amino  acids.  The  first  reaction  will  contain  nucleotides  G  and  A,  each  50%, 
at  the  first  codon  position  of  residue  73  followed  by  nucleotides  C  and  G  at  the  second 
and  third  position  of  the  codon.  The  second  primer  reaction  will  contain  nucleotides  A,  A 
and  T  at  the  first,  second,  and  third  positions,  respectively,  corresponding  to  residue  73. 
'Primer  #2'  will  then  consist  of  2-parts  reaction  one  and  1-part  reaction  two,  which  will 
yield  33%  of  each  primer  type  in  'Primer  #2'.  'Primer  #3'  will  be  the  reverse  complement 
of 'Primer  #2'.  'Primer  #4'  works  on  the  same  principle  as  'Primer  #1'.  One  primer 
reaction  will  generate  twelve  distinct  primer  types  because  there  are  three  ambiguous 
sites,  with  one  site  containing  three  ambiguities  (22  x  31).  Before  the  full  length  products 
are  generated,  two  PCR  reactions  will  be  run.  The  first  reaction  will  use  'Primers  #1  and 
#3'.  The  second  reaction  will  use  'Primers  #2  and  #4'.  The  final  reaction  will  use  the 
products  of  reactions  one  and  two,  overlapping  sequence  due  to  'Primers  #2  and  #3',  along 
with  'Primers  #1  and  #4'.  Using  this  method  will  allow  one  to  generate  full-length 
products  that  contain  every  possible  combination  of  the  six  ambiguous  sites  (144  different 
sequences).  Although  this  example  does  not  show  all  the  ambiguous  sites  in  the  Bacillus 
ancestral  sequence,  the  same  principle  can  be  applied  to  the  other  sites  until  PCR 
products  have  been  generated  that  contain  all  possible  combinations  of  the  ambiguous 
residues  over  the  entire  sequence.  After  all  possible  sequences  have  been  generated  they 
can  be  cloned  and  expressed.  A  certain  percentage  of  the  proteins  can  be  assayed  as 
previously  described.  Subsequently,  a  distribution  can  be  generated  for  the  temperature 
stability  of  the  Bacillus  ancestral  sequence. 

The  same  types  of  experiments  can  be  performed  using  the  PAML  generated 
Methanococcus  ancestral  sequences.  M.  jannaschii  (optimal  growth  temperature  85°C) 
and  M.  vannielii  (optimal  growth  temperature  35°C)  have  been  cloned  and  sequenced 
(Bult  et  al,  1996;  Lechner  and  Bock,  1987).  The  tree  topology  of  EF-Tu  archaeal 
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sequences  have  been  determined  (Baldauf  et  al,  1996).  This  topology  was  used  as  input 
for  PAML.  PAML  generated  the  ancestral  sequence  for  the  last  common  ancestor  of  the 
Methcmococcus  species.  The  sequence  contains  1 8  ambiguous  sites,  2   .  Roughly 
260,000  distinct  ancestral  sequences  will  be  generated  using  the  above  procedures.  Once 
a  percentage  of  the  sequences  are  cloned,  expressed,  and  analyzed,  another  distribution 
can  be  created  based  on  temperature  stability. 

The  two  distributions  created  by  the  Bacillus  and  Methanococcus  analyses  can  enable 
us  to  compare  the  distribution  generated  from  EF-Tu  ancestral  sequences  from  the  base  of 
the  bacterial  lineage  to  determine  if  the  ancestor  of  all  bacteria  was  thermophilic  or 
mesophilic. 

Although  the  experiments  in  this  section  were  not  performed,  we  believe  that  they  can 
provide  great  insight  into  the  question  of  whether  the  last  common  ancestor  to  all  of 
bacteria  was  a  thermophile  or  a  mesophile.  A  major  limiting  factor  to  the  successful 
completion  of  these  experiments  may  be  the  fact  that  a  bacterial  tree  topology  cannot  be 
firmly  established.  In  the  next  section  of  Chapter  1  we  show  how  minor  branch  swapping 
in  a  topology  can  have  major  effects  on  the  ancestral  reconstruction  at  basal  nodes  of  a 
tree.  In  conjunction  with  the  already  large  numbers  of  ambiguous  residues  using  the  Pace 
topology,  trying  to  incorporate  all  of  the  additional  ambiguous  residues  that  alternative 
tree  topologies  generate  would  result  in  an  extremely  large  number  of  variants  to  cover  all 
possible  combinations  of  ambiguous  residues.  However,  we  believe  that  the  foundation 
to  these  experiments  has  been  firmly  established  by  our  initial  analyses. 
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Reconstructing  the  Divergent  Evolution  of  RNases 
Background 

Three  paralogous  lineages  of  ribonucleases  have  emerged  via  gene  duplications 
during  the  evolution  of  artiodactyls  (camel,  cattle,  deer,  pig,  etc.)-  The  pancreatic 
ribonuclease,  RNase  A,  is  one  of  the  best  studied  of  all  enzymes  (Blackburn  and  Moore, 
1982).  This  enzyme  hydrolyzes  the  RNA  produced  from  bacteria  in  the  rumen  of  these 
animals.  The  function  of  brain  ribonuclease  is  currently  being  examined  in  the  Benner 
group.  Finally,  seminal  ribonuclease  is  unique  in  that  it  has  been  hypothesized  to  play  a 
role  in  immunosuppressivity  whereby  the  sperm  are  able  to  evade  the  female's  immune 
response  during  copulation  (Soucek  et  al,  1983).  The  seminal  lineage  arose  near  the 
time  of  the  divergence  of  deer  from  other  members  of  the  artiodactyl  order,  ca.  40  million 
years  ago.  Trabesinger-Ruf  (1997)  demonstrated  that  bovine  seminal  RNase  could  bind 
to  spermatozoa,  supporting  the  hypothesis  of  this  protein's  role  in  immune  responses. 
Subsequently,  Raley  (2000)  attempted  to  determine  when,  in  the  evolutionary  history  of 
the  seminal  lineage,  the  protein  evolved  the  ability  to  exert  immunosuppressive  activity. 
Based  on  parsimony  analyses,  ancestral  reconstructions  of  seminal  RNase  were  generated 
and  tested  in  the  laboratory,  Figure  1-9.  Two  ancestral  nodes  at  the  base  of  the  seminal 
tree,  for  alternative  topologies,  and  one  node  at  the  base  of  the  Bovidae  clade  were 
reconstructed.  Molecular  and  paleontological  data  are  unable  to  unambiguously  resolve 
the  relationship  of  the  deer  and  okapi  lineages.  These  lineages  either  arose  independently 
ca.  40  million  years  ago,  or  together  ca.  35  million  years  ago.  The  Bovidae  arose  ca.  5 
million  years  ago.  Table  1-1  shows  the  experimental  in  vitro  behaviors  of  these 
sequences  as  compared  to  extant  pancreatic  and  seminal  sequences. 
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Figure  1-9.  Tree  topology  of  seminal  RNase  with  pancreatic  RNases  as  outgroups.  This 
is  in  agreement  with  the  morphological/paleontological  data  (Raley,  2000).  An40 
represents  the  ancestral  sequence  at  the  base  of  the  seminal  tree,  while  An35  represents 
the  ancestral  sequence  at  the  base  of  the  tree  using  the  alternative  topology  of  okapi  and 
deer  being  monophyletic  (see  dashed  line).  An5  represents  the  ancestral  sequence  of 
Bovidae. 
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Some  behaviors  have  not  changed  significantly  during  this  episode  of  sequence 
evolution.  This  implies  that  these  in  vitro  behaviors  are  not  relevant  to  the  new 
physiological  functions  of  seminal  RNase.  In  contrast,  some  in  vitro  behaviors  have 
changed  markedly  during  this  episode  of  possibly-adaptive  sequence  evolution.  The 
steady  increase  in  immunosuppressivity  "up"  the  tree  is  culminated  by  an  IC50  value  of 
5ug/mL  in  the  extant  seminal  sequence.  Also,  the  ability  to  inhibit  cell  proliferation  has 
constantly  increased  to  4.2  ug/mL.  Therefore  immunosuppressivity  has  only  recently 
arisen  through  natural  selection  and  can  be  hypothesized  to  be  the  major  selective 
function  of  seminal  RNase  according  to  these  assays. 


Table  1-1.  Comparisons  of  the  in  vitro  properties  for  ancestral  seminal  ribonucleases, 
and  the  extant  bovine  seminal  and  pancreatic  ribonucleases.  For  detailed  description  of 
assays  see  Raley  (2000). 
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Maximum  Likelihood  Models  for  Reconstructing  Ancestral  RNases 

Reconstructed  ancestral  sequences  using  higher  order  models  that  include  branch 
lengths  and  substitution  matrices  can  result  in  different  reconstructions  of  residues  than 
simpler  parsimony-based  analyses.  In  Figure  1-10,  we  show  how  different  likelihood 
models  can  reconstruct  different  residues  at  some  positions.  This  figure  shows  how 
incorporating  different  models  can  lead  varying  results  and  how  each  model  can  calculate 
a  high  probability  associated  with  their  results,  regardless  of  whether  the  result  is 
"correct".  Thus,  caution  must  be  taken  when  incorporating  models  of  sequence  evolution 
into  molecular  analyses. 

It  is  unclear  whether  the  gene  duplication  event  that  gave  rise  to  seminal  RNase 
resulted  from  a  duplication  of  the  pancreatic  or  brain  ribonuclease  gene  (although  some 
phylogenetic  analyses  suggest  pancreatic  as  the  precursor).  Since  the  composition  of 
outgroups  can  effect  reconstructions  at  the  base  of  the  ingroup  tree,  we  tested  for  these 
effects  on  the  seminal  tree,  part  (a).  As  expected,  reconstructions  were  only  effected  at 
the  base  of  the  tree  (An40  and  An35).  The  reconstructions  at  these  nodes  are  clearly 
influenced  by  the  sequences  leading-to  and  leading-away  from  these  nodes.  In  addition, 
we  tested  the  reconstructions  using  an  amino  acid  model  and  a  codon  model.  The  amino 
acid  model  relies  on  a  replacement  matrix  based  on  observed/expected  values  for  a  large 
number  of  analyzed  replacements  (e.g.,  JTT  and  Dayhoff  matrices).  The  codon  model 
relies  directly  on  the  data  since  the  substitution  matrix  is  formulated  from  the  input  data 
set.  Both  models  have  their  advantages  and  disadvantages.  The  codon  model  works 
from  a  64-by-64  matrix  (minus  the  stop  codons)  but  can  over-parameterize  the  analysis, 
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whereas  the  amino  acid  model  works  from  a  20-by-20  matrix  but  may  not  be  complex 
enough  to  accurately  represent  the  mode  of  evolution  for  certain  data  sets. 

Differences  between  the  two  models,  and  between  the  two  outgroups,  are  apparent  for 
all  three  ancestral  nodes.  For  example,  position  22  has  a  serine  predicted  by  the  amino 
acid  model  for  An40  with  the  pancreatic  outgroup,  but  asparagine  is  predicted  as  the  most 
likely  ancestral  residue  by  the  codon  model  for  the  same  node  and  outgroup.  The  effect 
of  outgroup  composition  is  seen  at  position  65  for  An40.  A  glutamine  is  predicted  with  a 
high  degree  of  probability  by  both  the  amino  acid  and  codon  models  using  pancreatic 
RNases  as  the  outgroup,  although  lysine  is  predicted  by  both  models  with  brain  as  the 
outgroup.  A  number  of  examples  can  be  highlighted.  It  is  thus  important  to  understand 
the  effect  of  incorporating  different  models  into  ancestral  reconstruction  analyses. 


Figure  1-10.  Comparing  seminal  RNase  ancestral  reconstructions  between  parsimony 
and  various  models  of  maximum  likelihood.  All  maximum  likelihood  analyses  were 
performed  using  PAML,  under  conditions  as  stated  in  the  text. 

a)  The  extant  amino  acid  RNase  A  (pancreatic)  sequence  is  listed  in  single  letter  code. 
Ancestral  residues  are  listed  when  the  reconstruction  differs  from  the  extant  sequence. 
An40,  An35  and  An5  seminal  reconstructions  are  based  on  parsimony  analyses  with  the 
complete  pancreatic  clade  as  the  outgroup.  Maximum  likelihood  reconstructions  are 
listed  below  the  parsimony  reconstructions  for  the  appropriate  nodes  using  a  subset  of 
pancreatic  or  brain  sequences  for  the  outgroup.  The  posterior  probabilities  using  a  codon 
or  amino  acid  model  are  listed  to  the  left  and  right  of  the  reconstruction,  respectively, 
only  when  the  reconstruction  differs  from  the  parsimony  reconstruction.  When  room  is 
not  available,  the  probability  using  the  codon  model  is  listed  to  the  right  in  parentheses. 
Also,  the  extant  bovine  seminal  residues  are  listed  at  positions  that  differ  from  the  extant 
pancreatic  sequence;  b)  Same  format  as  in  part  (a).  However,  only  the  reconstructions 
corresponding  to  node  An40,  and  using  pancreatic  sequences  as  the  outgroup,  are  listed. 
This  figure  attempts  to  elucidate  the  influence  of  outgroup  sample  size  and  1x4  versus 
3x4  base  frequency  tables,  under  the  codon-based  analysis,  on  ancestral  reconstructions. 
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Figure  1-10  continued 
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The  original  parsimony-based  analysis  using  amino  acid  data,  and  pancreatic  RNases 
as  the  outgroup,  identified  two  positions  that  differed  between  nodes  An40  and  An35 
(positions  65  and  113).  Our  maximum  likelihood  analysis  may  have  "resolved"  these 
differences,  as  they  no  longer  exist  under  the  amino  acid  model  with  the  pancreatic 
outgroup.  However,  four  additional  positions  are  now  highlighted  in  our  analyses 
comparing  An40  and  An35  for  amino  acid  data  with  the  pancreatic  outgroup  (positions 
19, 22, 64  and  70) 

The  effects  of  incorporating  base  frequency  tables  into  the  codon  model  were  tested 
for  An40  using  the  complete  pancreatic  outgroup  or  a  sample  of  the  sequences  from  the 
outgroup.  Table  1-2  shows  the  unequal  distribution  of  base  frequencies  for  the  first, 
second  and  third  codon  positions  of  seminal  ribonuclease.  A  3x4  base  frequency  table 
incorporates  this  non-uniform  GC  distribution,  whereas  a  1x4  table  only  incorporates  the 
base  frequencies  treating  the  codon  as  a  single  unit.  Positions  22,  64  and  1 13  are 
influenced  by  the  incorporation  of  these  two  different  parameters,  Figure  1-10  part  (b). 
Interestingly,  only  sites  22  and  1 1 3  are  influenced  when  using  either  the  complete  or 
partial  pancreatic  outgroup.  Parameters,  models  and  data  input  can  all  affect  the  outcome 
of  ancestral  reconstructions.  It  is  therefore  necessary  to  incorporate  various  models  into 
the  computational  reconstruction  analysis  before  any  sequences  are  generated  in  the 
laboratory. 
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Table  1-2.  Base  frequencies  used  in  the  codon  model  analysis  for  ancestral 
reconstructions  of  seminal  ribonuclease.  The  1x4  table  only  includes  the  mean  base 
frequencies  for  the  codons.  The  3x4  table  includes  the  base  frequencies  for  all  three 
codon  positions  when  reconstructing  the  ancestral  sequences.  Therefore,  only  by 
incorporating  the  3x4  table  can  the  extreme  GC  bias  of  the  third  codon  position  be 
considered. 


Codon  position 

T 

C 

A 

G 

1 

22% 

17% 

34% 

27% 

2 

19% 

25% 

36% 

20% 

3 

13% 

47% 

10% 

30% 

Mean 

18% 

30% 

27% 

26% 

An  Alternative  View  of  Reconstructing  Ancestral  Sequence  "Space" 

Although  the  methods  of  reconstructing  ancestral  sequences  have  greatly  improved  in 
the  past  few  years,  the  outputs  of  these  methods  have  not  been  comprehensively  studied. 
This  study  is  unique  in  that  it  marks  the  first  time  a  data  set  has  been  analyzed  by 
multiple  models  and  multiple  parameters  with  the  intention  of  actually  reconstructing 
these  ancestral  sequences  in  the  laboratory.  Generating  reconstructed  sequences  has 
always  relied  on  individual  nodes  within  the  tree  topology.  However,  our  analysis 
strongly  suggests  that  it  may  not  be  possible  to  reconstruct  these  points  on  a  tree.  We 
would,  therefore,  like  to  advocate  a  new  approach  for  reconstructing  ancestral  sequences 
along  a  tree.  This  is  conceptually  shown  in  Figure  1-11.  It  is  possible  that  we  may  never 
know  whether  deer  and  okapi  are  monophyletic,  or  whether  the  pancreatic  or  brain 
outgroup  duplicated  to  give  rise  to  the  seminal  lineage.  However,  we  should  not  be 
paralyzed  by  this  lack  of  information.  By  incorporating  historical  and  evolutionary 
ambiguities,  we  have  shown  that  it  is  possible  to  calculate  the  ancestral  sequence 
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(Okapi) 


Okapi 
(Saiga) 


Figure  1-11.  Representation  of  the  mutation  space  that  ancestral  seminal  RNase 
sequences  have  evolved  through.  It  is  important  to  note  that  this  space  is  not  random 
space,  but  rather  mutation  space  in  which  natural  selection  has  presumably  eliminated 
any  deleterious  mutations.  The  dashed  line  represents  the  alternative  tree  topology  with 
corresponding  nodes  in  parentheses.  Reconstructing  ancestral  sequences  from  this  space 
allows  one  to  identify  all  possible  mutations  regardless  of  the  exact  point  of  the  mutation 
on  the  tree,  and  is  therefore  advantageous  over  other  methods  that  reconstruct  sequences 
at  individual  points  on  the  tree. 
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population  around  an  unknown  point  in  evolution.  This  has  clear  advantages  since  the 
analysis  does  not  depend  on,  for  example,  what  the  true  outgroup  is.  Also,  we  can 
correlate  rapid  sequence  evolution  with  change  in  function  along  the  tree.  Obviously, 
interpretation  of  the  behavioral  properties  of  the  reconstructed  proteins  will  depend  on 
these  ambiguities.  In  short,  we  believe  this  method  will  prove  valuable  for  understanding 
molecular  evolution  within  a  paleogenomics  framework  and  assist  in  understanding  the 
mechanisms  by  which  divergent  protein  behavior  evolves. 


CHAPTER  2 

DETECTING  FUNCTIONAL  DIVERGENCE  IN  BIOLOGICAL  SEQUENCES: 

HISTOGRAM  APPROACH  USING  ORIGINAL  RATE  DIFFERENCES 


Background 

The  divergent  evolution  of  protein  sequences  from  genomic  databases  can  be 
analyzed  using  different  mathematical  models.  The  most  common  treat  all  sites  in  a 
protein  sequence  as  equally  variable.  More  sophisticated  models  acknowledge  the  fact 
that  purifying  selection  generally  tolerates  variable  amounts  of  amino  acid  replacement  at 
different  positions  in  a  protein  sequence.  In  their  "stationary"  versions,  such  models 
assume  that  the  replacement  rate  at  individual  positions  remains  constant  throughout 
evolutionary  history.  "Non-stationary"  covarion  versions,  however,  allow  the 
replacement  rate  at  a  position  to  vary  in  different  branches  of  the  evolutionary  tree. 
Recently,  statistical  methods  have  been  developed  that  highlight  this  type  of  variation  in 
replacement  rates.  Here,  we  show  how  positions  that  have  variable  rates  of  divergence  in 
different  regions  of  a  tree  ("covarion  behavior"),  coupled  with  analyses  of  experimental 
three-dimensional  structures,  can  provide  experimentally  testable  hypotheses  that  relate 
individual  amino  acid  residues  to  specific  functional  differences  in  those  branches.  We 
illustrate  this  in  the  elongation  factor  family  of  proteins  as  a  paradigm  for  applications  of 
this  type  of  analysis  in  functional  genomics  generally. 

Elongation  factors  Tu  (EF-Tu)  and  la  (EF-la)  are  homologous  proteins  essential  to 
translation  in  bacteria  and  eukaryotes,  respectively  (Krab  and  Parmeggiani,  1998; 
Negrutskii  and  El'skaya,  1998).  These  GTPases  catalyze  the  binding  of  aminoacyl- 
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transfer  RNAs  (aa-tRNA)  to  the  A-site  of  the  ribosome.  As  they  are  among  the  slowest 
evolving  proteins  known,  EFs  are  commonly  used  to  study  cellular  functions  (Negrutskii 
and  El'skaya,  1998;  Yang  et  al,  1990;  Duttaroy  et  al,  1998)  and  to  root  the  universal  tree 
of  life  (Lopez  et  al,  1999;  Baldauf  et  al,  1996).  This  sequence  stability  presumably 
reflects  enormous  functional  constraints  on  the  divergent  evolution  of  EFs,  highlighting 
their  central  role  in  translation  since  the  last  common  ancestor  of  the  three  primary 
domains  of  life  (Benner  et  al,  1989).  Nevertheless,  EF-Tu  and  EF-la  differ  in  several  of 
their  specific  functions  (Krab  and  Parmeggiani,  1998;  Negrutskii  and  El'skaya,  1998). 
For  example,  bacterial  EF-Tu  binds  GDP  -100  fold  tighter  than  GTP.  Eukaryotic  EF-la, 
in  contrast,  binds  both  with  similar  affinities.  EF-Tu  regenerates  its  active  form  by 
binding  to  the  single-subunit  nucleotide  exchange  factor  EF-Ts.  EF-la  requires  the 
multi-subunit  nucleotide  exchange  factor  EF-1  Py5.  EF-la  also  interacts  with  the 
eukaryotic  cytoskeleton  and  may  thereby  play  a  role  in  cellular  transformation  and 
apoptosis  (Negrutskii  and  El'skaya,  1998;  Yang  et  al,  1990).  EF-Tu  can  have  no  such 
role  in  bacteria. 

These  shifts  in  function  must  correspond  at  some  level  to  changes  in  protein 
sequence.  Thus,  functional  changes  can  leave  signatures  in  the  sequences  of  a  protein 
family,  which  can  then  be  detected  with  a  well  constructed  history  of  their  relationships 
and  replacements.  In  many  cases,  it  appears  possible  to  identify  this  record  from  the 
background  noise  of  molecular  evolution.  In  alcohol  dehydrogenase  (Benner  et  al,  1998) 
and  superoxide  dismutase  (Miyamoto  and  Fitch,  1995),  for  example,  previous  studies 
have  shown  that  variable  replacement  rates  at  specific  positions  can  generate  inferences 
relating  changes  in  sequence  structure  to  those  in  function.  These  proteins,  however, 
have  diverged  far  more  rapidly  than  EFs.  Further,  these  studies  have  used  neither  the  full 
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power  of  a  mathematical  evolutionary  (Benner  et  al,  1998)  nor  crystallographic 
(Miyamoto  and  Fitch,  1995)  analysis.  We  show  here  how  this  combination  is  of  value  in 
functional  genomics,  even  in  proteins  not  generally  regarded  as  good  examples  of 
functional  divergence. 

From  a  mathematical  perspective,  the  most  common  way  to  model  rate  heterogeneity 
among  sequence  positions  is  the  gamma  distribution,  with  its  shape  parameter  alpha  (a) 
(Swofford  et  al,  1996;  Yang,  1996).  This  distribution  can  accommodate  a  wide  range  of 
rapidly  and  slowly  evolving  sites.  However,  this  model  assumes  a  stationary  substitution 
process,  whereby  positions  retain  their  same  relative  rates  of  change  throughout 
evolutionary  history.  This  assumption  is  not  expected  to  hold  entirely  true  for  proteins 
that  change  function.  As  an  alternative,  the  covarion  model  proposes  that  the  replacement 
rates  of  amino  acid  positions  can  change  over  time  (Miyamoto  and  Fitch,  1995;  Fitch  and 
Markowitz,  1970;  Tuffley  and  Steel,  1998;  Gu,  1999;  Morozov  et  al,  2000).  Although 
EFs  might  be  expected  to  follow  only  a  gamma  model  given  their  overall  functional 
conservation,  previous  studies  have  instead  suggested  that  a  covarion  process  is  needed  to 
adequately  describe  their  evolution  (Lopez  et  al,  1999;  Lockhart  et  al,  1998;  Moreira  et 
al,  1999).  This  conclusion  is  examined  more  closely  in  this  chapter  and  forms  the  basis 
of  our  integrated  evolutionary  and  structural  biology  analyses  of  functional  divergence 
between  EF-Tu  and  EF-la. 

Methods 

Thirty  EF  sequences  were  aligned  by  DARWIN  (Benner,  1998)  and  then  modified 
according  to  the  secondary  structures  of  EF-Tu  for  Escherichia  coli  (PBD  accession 
number  1EFC)  (Song  et  al,  1999)  and  Thermus  aquaticus  (PBD  accession  number 
1TTT)  (Nissen  et  al,  1995).  This  approach  resulted  in  a  multiple  sequence  alignment 
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(MSA)  with  380  aligned  positions  (cf.  Moreira  et  al,  1999).  Maximum  likelihood  (ML) 
estimations  of  a  and  the  replacement  rates  per  site  for  all  380  aligned  positions  of  EF-Tu 
versus  EF-la  were  accomplished  with  PAML,  v2.0,  and  its  implementation  of  the  Jones, 
Taylor,  and  Thorton  model,  with  rate  heterogeneity  among  sites  according  to  the  gamma 
distribution  (JTT-r)  (Yang,  1997).  The  Proportional,  Poisson,  and  Dayhoff  models  for 
protein  sequences  were  rejected  as  less  appropriate  for  EFs  on  the  basis  of  their  log- 
likelihood  ratio  tests  (Huelsenbeck  and  Rannala,  1997).  The  phylogeny  in  these  ML 
analyses  followed  that  of  Bauldauf  et  al.  (Baldauf  et  al,  1996),  except  for  the  topological 
positions  of  Chlorobium  and  Salmonella.  As  Bauldauf  et  al.  did  not  consider  these  two 
species,  their  topological  positions  were  based  on  our  follow-up  ML  analyses  with 
MOLPHY,  v.2.3  (Adachi  and  Hasegawa,  1997). 

Parametric  bootstrapping  (evolutionary  simulations)  was  conducted  with  PAML  to 
calculate  the  standard  deviations  (SD)  of  the  a  estimates  for  bacteria  alone,  eukaryotes 
alone,  and  both  groups  combined  (Huelsenbeck  et  al,  1995).  These  simulations  (20  per 
group)  relied  on  the  accepted  tree  and  subtrees  of  bacteria  and  eukaryotes,  their  ML 
estimates  of  branch  lengths  and  a,  and  the  JTT-r  model.  In  turn,  subsampling 
experiments  with  bacteria  alone,  eukaryotes  alone,  and  the  two  groups  combined  were 
completed  to  test  for  sample-size  effects  on  their  estimations  of  a  (Sullivan  et  al,  1999). 
In  these  experiments,  20  random  subsets  apiece  were  generated  for  all  odd-numbered 
subsamples  from  5  to  1 1, 13,  and  27  for  bacteria,  eukaryotes,  and  both  groups, 
respectively.  The  a  parameter  was  then  re-estimated  for  each  random  subsample  using 
the  same  ML  conditions  as  before.  In  recognition  of  their  greater  numbers,  the 
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subsampling  trials  with  both  groups  combined  were  stratified  such  that  an  extra 
eukaryotic  sequence  was  selected  relative  to  bacteria. 

Normal  distributions,  sample  kurtosis,  skewness,  and  normality  tests  were  all 
determined  with  SAS/Graph,  rel.  6.03  (SAS,  1988).  Visualization  of  protein  structures 
was  accomplished  with  Chemscape  Chime,  rel.  2.0.3  (www.mdli.com)  and  Protein 
Explorer,  rel.  1 .46  (www.umass.edu/microbio/chime/explorer). 

Covarion  Analyses,  Structural  Biology,  and  Hypothesis  Generation 

The  results  of  the  log-likelihood  tests  are  shown  in  Table  2-1.  No  significant 
differences  in  log-likelihood  scores  were  observed  between  the  Dayhoff-gamma, 
Dayhoff-f-gamma,  and  JTT-gamma  models.  The  score  obtained  by  integrating  the  JTT- 
gamma-rho  was  significantly  different  than  the  JTT-gamma.  Rho  (p)  models  the  extent 
to  which  adjacent  sites  are  evolving  non-independently  (i.e.,  correlation  in  evolutionary 
rates  for  neighboring  sites)  and  varies  between  0  (adjacent  sites  are  evolving 
independently)  and  1  (every  site's  evolutionary  rate  is  correlated  with  its  nearest 
neighbor's  rate)  (Yang,  1995).  To  test  whether  incorporating  these  various  models  had 
any  effect  on  the  estimation  of  replacement  rates,  we  plotted  the  rates  comparing  JTT- 
gamma  versus  JTT-gamma-rho  (JTT-r-p).  As  seen  in  Figure  2-1 ,  there  is  little  or  no 
difference  in  estimated  rates  for  both  bacteria  and  eukaryotes  when  using  either  model. 
The  correlation  coefficients  were  0.99508  and  0.99579  for  bacteria  and  eukaryotes, 
respectively.  The  near  identity  of  the  rate  estimates  for  the  JTT-gamma  and  JTT-gamma- 
rho  alternatives  support  the  conclusion  that  such  ML  calculations  are  robust  to  the  chosen 
model  of  sequence  evolution  (Yang,  1995).  However,  the  significantly  better  fit  of  the 
observed  data  to  the  JTT-gamma-rho  model  (with  its  ML  estimate  of  p  =  0.466)  argues 
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for  a  relatively  strong  positive  correlation  between  the  individual  rates  of  adjacent  sites. 
In  particular,  this  potential  lack  of  independence  is  of  special  interest  to  the  covarion 
sites,  because  of  their  evident  structural  and  functional  ties  (see  below). 


Table  2-1 .  Log-likelihood  scores  for  the  EF  data  set  using  different  models  of  amino  acid 
evolution.  Log-likelihood  Ratio  Test  (LRT)  is  a  measure  of  the  significance  between  two 
competing  nested-models.  In  this  case,  there  is  one  degree  of  freedom  for  the  test  when 
comparing  the  JTT-gamma  and  JTT-gamma-rho  models  {8=2[-9287-(-9307)]=40}. 


Model 

Log-likelihood  score 

1)  Poisson 

-10741 

2)  Proportional 

-10504 

3)  JTT-f 

-9779 

4)JTT 

-9752 

5)  Dayhoff-f 

-9748 

6)  Dayhoff 

-9744 

7)  JTT-f-gamma 

-9326 

8)  JTT-gamma 

-9307 

9)  Dayhoff-f-gamma 

-9302 

10)  Dayhoff-gamma 

-9299 

11)  JTT-gamma-rho 

-9287 

LRT  between  8  &  11 

P<0.005 

Our  ML  analyses  of  EF-Tu  and  EF-lot  revealed  a  non-stationary  a  for  different 
regions  of  the  tree  (Figure  2-2).  An  a  of  0.78  was  calculated  for  the  entire  tree,  with  a  SD 
of  0.05  from  parametric  bootstrapping.  In  contrast,  the  a  values  for  both  the  bacterial  and 
eukaryotic  subtrees  were  significantly  lower  [a  =  0.46  (0.04)  and  a  =  0.38  (0.04), 
respectively].  Thus,  a  more  uniform  distribution  of  rates  among  sites  was  suggested 
when  the  two  groups  were  considered  together,  rather  than  separately.  Gu  (1999) 
statistically  proved  that  such  an  increase  in  a  is  expected  when  the  variable  positions  of 
one  group  are  not  the  same  as  those  of  another  (i.e.,  when  the  sequences  are  evolving 
under  a  non-  stationary  covarion  process).  For  a  schematic  representation  of  this  concept 
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Eukaryotic  relative-replacement  rates  without  the  incorporation  of  rho 


Figure  2-1 .  The  effects  of  incorporating  the  rho  parameter  when  estimating  bacteria  and 
eukaryotic  relative-replacement  rates.  The  rho  parameter  signifies  whether  adjacent  sites 
(/'  and  /+1)  are  evolving  independently  or  dependently  (correlated). 
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Figure  2-2.  Accepted  phylogeny  for  bacteria  and  eukaryotes  used  in  the  ML  analyses  of 
their  EF-Tu  and  EF-la  sequences.  These  sequences  are  from  SWISS-PROT,  with  their 
accession  numbers  given  in  parentheses  next  to  their  species.  Brackets  refer  to  the  amino 
acids  of  the  two  groups  at  position  305,  a  site  illustrating  a  covarion  pattern  of  sequence 
conservation  in  bacteria  but  considerable  variation  in  eukaryotes.  Branch  lengths  of  this 
tree  are  drawn  proportional  to  their  ML  estimates,  except  for  the  two  longest  internodes 
leading  to  bacteria  and  eukaryotes  (both  1.30  replacements/site).  Total  tree  length  is  7.34 
replacements/site  (2.54  and  2.37  replacements/site  for  bacteria  and  eukaryotes  alone, 
respectively).  Numbers  above  internal  branches  represent  the  ML  estimates  of  a  for  the 
corresponding  group  or  subgroup  of  bacteria  and/or  eukaryotes.  Standard  deviations,  as 
calculated  from  twenty  rounds  of  parametric  bootstrapping,  are  given  in  parentheses  for 
the  a  values  of  bacteria,  eukaryotes,  and  the  two  groups  combined. 
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Figure  2-3.  A  schematic  representing  the  differences  between  the  stationary  gamma  and 
non-stationary  covarion  (gamma)  models.  These  two  examples  both  begin  with  an 
identical  ancestral  population  of  sequences  (Al  and  A2,  respectively).  This  ancestral 
population  contains  the  same  site-by-site  rate  variation  in  both  examples  as  represented 
by  the  number  of  circles  in  each  site;  the  first  site  is  slowly  evolving,  sites  2  and  3  are 
moderately  evolving,  while  site  4  is  rapidly  evolving.  In  the  case  of  stationary  gamma 
evolution,  both  descendent  populations  Dl  and  D2  contain  the  same  numbers  of  slowly, 
moderately  and  rapidly  evolving  positions  as  their  ancestor.  Thus,  their  rate  heterogeneity 
of  sites  is  the  same  and  the  a  value  for  each  is  expected  to  be  the  same. 

The  opposite  condition  holds  true  for  the  covarion  example.  Although  the  descendent 
populations  D3  and  D4  contain  the  same  numbers  of  slowly,  moderately  and  rapidly 
evolving  positions  as  their  ancestral  state  A2,  the  identities  of  these  sites  have  changed  in 
D4.  Site  1  is  now  rapidly  evolving  whereas  site  4  is  slowly  evolving.  It  is  this  non- 
stationary  behavior  that  distinguishes  the  covarion  and  gamma  models.  This  type  of 
covarion  behavior  causes  spurious  estimates  of  the  gamma  distribution.  The  estimated 
alpha  values  of  populations  D3  and  D4  are  the  same  as  A2  when  analyzed  individually 
since  each  descendant  has  one  slow,  one  rapid,  and  two  moderate  sites.  However,  when 
D3  and  D4  are  combined,  the  estimated  alpha  value  increases  due  to  an  averaging  effect 
of  the  rate  variation.  Thus,  as  statistically  proven  by  Gu  (99),  an  increased  a  value  for 
combined  data  is  one  sign  of  a  covarion  process. 
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see  Figure  2-3.  The  rate  profile  of  the  replacement  rate  differences  shows  that  the 
individual  site  rates  are  not  uniform  between  the  bacterial  and  eukaryotic  lineages  (Figure 
2-4) 

The  distribution  of  rate  differences  per  site  between  bacterial  and  eukaryotic  EFs  was 
leptokurtotic;  i.e.,  over-  and  under-represented  in  the  mean  and  tails  versus  "shoulders," 
respectively,  relative  to  the  expectations  of  a  normal  distribution  (Figure  2-5).  Nearly 
50%  of  the  positions  had  essentially  the  same  rate  in  the  two  groups  (rate  differences  of 
<0.5  replacements/site/unit  evolutionary  distance),  as  expected  under  a  stationary  gamma 
process.  However,  1 7  sites  were  evolving  >2  SD  faster  in  bacteria  than  eukaryotes,  while 
1 9  were  changing  >2  SD  faster  in  eukaryotes  than  bacteria  (Figure  2-5).  These  sites 
representing  10%  of  the  MSA  are  suggestive  of  a  covarion  process  in  the  EF-Tu/EF-la 
family. 
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Figure  2-4.  Rate  profile  for  the  replacement  rate  differences  between  the  bacteria  and 
eukaryotic  lineages.  Peaks  at  positive  and  negative  values  indicate  a  non-stationary 
process  for  the  evolution  of  rates  in  EFs 
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By  integrating  structural  data  with  these  ML  rate  differences,  this  initial  pool  of  36 
sites  can  be  further  reduced  to  a  subset  of  those  positions  that  are  most  likely  involved  in 
the  functional  shifts  between  EF-Tu  and  EF-la.  For  example,  ten  sites  in  and  around  the 
region  binding  tRNAs  are  evolving  >2  SD  faster  in  either  bacteria  or  eukaryotes  (Figures 
2-5  and  2-6).  These  rate  changes  can  be  correlated  to  a  difference  in  biochemical 
function  between  EF-Tu  and  EF-la.  EF- la/GDP  binds  charged  and  uncharged  tRNAs, 
whereas  EF-Tu/GDP  does  not.  Crystal lographic  data  for  EF-Tu  reveals  a  major 
conformational  shift  between  the  GDP-  and  GTP-bound  states,  whereby  the  tRNA- 
binding  site  of  the  former  is  disrupted  (Figure  2-6).  In  contrast,  available  data  for  EF-la 
suggest  that  this  conformational  shift  does  not  occur.  This  correlation  between  rate 
differences  and  protein  structure/function  leads  to  the  hypothesis  that  at  least  some  of 
these  ten  positions  are  responsible  for  the  different  interactions  of  EF-Tu  and  EF-la  with 
tRNA.  This  hypothesis  can  now  be  tested  by  introducing  into  EF-la  the  residues  of  EF- 
Tu  at  these  positions  (Golding  and  Dean,  1998).  The  prediction  is  that  these 
introductions  will  result  in  a  variant  of  EF- la/GDP  that  does  not  bind  uncharged  tRNA. 

Similarly,  eight  sites  in  and  around  the  region  where  nucleotide  exchange  factors  bind 
are  evolving  >2  SD  faster  in  eukaryotes  than  in  bacteria  (Figures  2-5  and  2-6).  EF-Tu 
regenerates  its  active  form  by  binding  to  the  single-subunit  nucleotide  exchange  factor 
EF-Ts,  whereas  EF-la  depends  on  the  multi-subunit  EF-1 3y8.  The  rate  differences  for 
these  eight  sites  lead  to  the  hypothesis  that  the  surface  area  of  EF-la  in  contact  with  its 
nucleotide  exchange  complex  is  different  than  that  for  EF-Tu.  This  difference  is 
consistent  with  the  divergent  structures  of  their  respective  nucleotide  exchange  factors 
(Krab  and  Parmeggiani,  1998;  Negrutskii  and  El'skaya,  1998). 
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Figure  2-5.  Rate  differences  per  site  between  bacteria  and  eukaryotes.  Top  part  of  figure 
is  the  histogram  of  the  site-by-site  rate  differences  for  the  380  aligned  positions  of 
bacteria  minus  eukaryotes.  Sample  kurtosis  and  skewness  measure  the  "peakedness"  and 
asymmetry  of  the  histogram  relative  to  the  superimposed  normal  distribution, 
respectively.  Bottom  part  of  figure  represents  the  amino  acid  positions  in  the  left  and 
right  tails  of  the  histogram  (i.e.,  those  with  rate  differences  of  >2  SD  between  the  two 
groups).  Numbering  refers  to  positions  in  the  MSA.  "a",  "P",  and  "L"  refer  to  a-helices, 
P-strands,  and  loops,  respectively,  following  the  three-dimensional  structure  of  EF-Tu 
(Figure  2-6). 
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Figure  2-6.  MSA  for  EFs  and  tertiary  structures  for  EF-Tu.  (A)  MSA  for  the  ligand- 
binding  region  at  the  NH2-terminus  of  three  representative  bacteria  and  three  eukaryotes 
(top  and  bottom,  respectively).  This  MSA  highlights  the  key  residues  for  aa-tRNA  (red), 
EF-Ts  (green),  and  nucleotide  (yellow)  binding  and  for  kirromycin  resistance  (cyan),  as 
determined  for  bacterial  EF-Tu.  Arrows,  above  and  below  the  MSA,  correspond  to  those 
sites  that  are  evolving  >2  SD  faster  in  bacteria  than  eukaryotes,  and  vice  versa, 
respectively  (positions  67,  69,  102-103,  1 17,  123,  131,  133  and  135)  (Figure  2-5).  (B) 
Tertiary  structures  of  the  GDP-  and  GTP-bound  states  for  EF-Tu  from  E.  coli  and  T. 
aquaticus,  respectively.  Here,  green  and  red  in  the  GTP  confirmation  highlight  those 
sites  that  are  evolving  >2  SD  faster  in  bacteria  than  eukaryotes,  and  vice  versa, 
respectively  (Figure  2-5). 
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Perhaps  the  most  intriguing  functional  difference  between  the  two  EFs  is  the  ability  of 
EF-la  to  bind  to  actin,  the  main  component  of  the  eukaryotic  cytoskeleton.  This 
function,  together  with  the  ability  of  EF-la  (but  not  EF-Tu)  to  bind  to  uncharged  tRNAs, 
may  be  important  as  a  mechanism  for  tRNA  channeling  from  the  ribosome  back  to  the 
nucleus  (Negrutskii  and  El'skaya,  1998;  Grosshans  et  al,  2000).  Bacteria,  of  course,  do 
not  require  channeling,  thereby  obviating  the  need  for  binding  of  uncharged  tRNAs  by 
either  the  GDP-  or  GTP-states  of  EF-Tu.  Relatively  rapid  sequence  evolution  is  a  general 
characteristic  of  surface  residues  that  are  not  involved  in  protein-ligand  interactions 
(Lichtarge  et  al,  1996).  Nine  surface  residues  to  which  other  contacts  cannot  be 
definitively  assigned  from  biochemical  and  structural  data  were  evolving  >2  SD  faster  in 
EF-Tu  than  EF-la  (Figures  2-5  and  2-6).  These  rate  differences  suggest  the  hypothesis 
that  at  least  some  of  these  residues  in  EF-la  are  in  contact  with  the  actin  cytoskeleton. 

Positions  32-36  are  conserved  in  EF-la,  but  variable  in  EF-Tu  (Figures  2-5  and  2-6). 
In  EF-Tu,  biochemical  and  three-dimensional  structural  data  show  that  this  region  is  in 
proximity  to  the  ribosome  (Peter  et  al,  1990;  Ban  et  al,  1999).  In  EF-la,  positions  32- 
36  are  followed  by  an  insertion  that  is  suggestive  of  a  binding  site  with  its  characteristic 
charged  amino  acids  and  hydrophobic  residues.  In  combination  with  its  conserved 
residues  32-36,  this  insertion  is  predicted  to  introduce  a  regular  secondary  structural 
element  of  an  a-helix  (Benner  et  al,  1998;  Rost  and  Sander,  1993)  that  may  reflect  a 
difference  in  ribosomal  structure  and  binding  between  bacteria  and  eukaryotes.  Thus, 
another  testable  hypothesis  is  suggested  by  the  integration  of  rate  differences  with  protein 
structure  and  function. 

How  robust  are  our  hypotheses  with  respect  to  the  current  sample  of  sequences?  This 
question  follows  from  the  recent  demonstration  by  Sullivan  et  al.  (1999)  that  ML 
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estimates  of  rate  variation  among  sites  may  be  sensitive  to  taxon  sampling.  In  our 
subsampling  experiments,  estimates  of  a  were  found  to  be  upwardly  biased  for  the 
smaller  samples  of  all  three  groups  (Figure  2-7).  Nevertheless,  the  same  major  difference 
between  bacteria  and  eukaryotes  alone  versus  combined  was  evident,  regardless  of  the 
sample  size.  Also,  a  remained  largely  unchanged  (within  the  range  of  statistical  error) 
with  the  inclusion  of  40  and  15  additional  sequences  from  SWISS-PROT  for  bacteria 
(0.48)  and  eukaryotes  (0.35),  respectively.  Given  our  initial  focus  on  the  fluctuating 
estimates  of  a  for  bacteria  and  eukaryotes,  our  study  did  not  consider  Archaea.  However, 
our  more  recent  investigations  of  EFs  document  that  this  group  is  defined  by  an  a  (0.88) 
that  is  more  similar  to  the  combined  estimate  for  bacteria  and  eukaryotes  than  to  their 
separate  values.  Collectively,  these  various  results  argue  against  sampling  error  as  an 
explanation  for  the  non-stationary  behavior  of  a  for  EF-Tu  versus  EF-la. 

We  also  tested  the  effect  of  estimating  replacement  rates  when  using  different  alpha 
values.  Two  sets  of  rates  were  estimated  for  bacteria-only  using  alpha  equal  to  0.46  and 
0.78  (correct  and  incorrect  values,  respectively).  The  same  was  done  for  eukaryotes-only 
with  alpha  equal  to  0.38  and  0.78.  A  concatenated  data  set  of  bacterial  rates  as  estimated 
with  alpha=0.46  and  eukaryotic  rates  as  estimated  with  alpha=0.38  was  created.  Another 
concatenated  data  set  was  created  as  above,  but  included  rates  as  estimated  with 
alpha=0.78  for  both  lineages.  Figure  2-8  shows  the  results  of  this  comparison.  Although 
the  correlation  coefficient  indicates  that  there  is  no  difference  in  estimated  rates  when 
using  either  the  correct  or  incorrect  alpha  value,  a  subtle  but  significant  difference  can  be 
seen  upon  closer  inspection.  The  slope  of  the  linear  fit  is  0.75843  and  the  y-intercept  is 
0. 1 7692.  These  results  demonstrate  a  clear  bias  when  using  different  alpha  values.  The 
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Figure  2-7.  The  effect  of  sequence  sample  size  on  the  ML  estimation  of  a  for  bacteria 
alone,  eukaryotes  alone,  and  the  two  groups  combined.  "X's"  correspond  to  the  final 
estimates  of  a  for  each  group.  Twenty  subsampling  experiments  were  completed  for  each 
sample  size  of  a  group,  with  the  results  summarized  as  means  and  their  SD. 
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Figure  2-8.  Comparing  the  bacterial  and  eukaryotic  replacement  rate  estimates  when 
using  the  alpha  values  of  0.46  and  0.38  from  the  individual  lineages,  respectively,  versus 
the  alpha  value  of  0.78  from  the  complete  tree.  The  values  on  the  x-  and  y-axes  represent 
the  relative-replacement  rates  (replacements/site/unit  evolutionary  time) 
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incorrect  alpha  value  of  0.78  forces  the  replacement  rates  to  have  a  more  uniform 
distribution,  and  since  most  of  the  sites  have  low  replacement  rates,  the  incorrect  alpha 
doesn't  allow  truly  rapidly  evolving  sites  to  have  a  high  rate.  In  addition,  notice  how  the 
y-coordinate  peaks  at  a  value  of -3.4,  while  the  x-coordinate  peaks  at  ~4.2.  These  two 
peaks  would  be  nearly  identical  if  alpha  values  had  no  effect  on  the  estimates  of 
replacement  rates.  Thus,  we  have  again  demonstrated  that  replacement  rates  are  not 
stationary  for  elongation  factors  and  that  rates  are  intimately  correlated  to  alpha,  as 
expected. 

Covarion  Approaches  and  Functional  Genomics 

Functional  genomics  is  the  bridge  between  computational  and  experimental  biology 
(Bork  and  Koonin,  1998;  Benner  et  al,  2000).  The  field  combines  sequence  data  with 
general  knowledge  to  generate  testable  hypotheses  about  the  biological  functions  of  genes 
and  proteins.  Today,  most  hypotheses  in  the  field  are  generated  from  sequence  similarity 
searches  with  BLAST  (Lipman  and  Pearson,  1985)  or  FASTA  (Altschul  et  al,  1997). 
The  function  of  the  probe  sequence  is  assumed  to  equal  that  of  the  best  annotated  hit  that 
is  recovered  in  these  similarity  searches. 

Functional  genomics  is  actively  seeking  tools  to  detect  changes  in  protein  function 
from  their  sequences  and  estimated  history  (Gu,  1999;  Yang  et  al,  2000).  The  best- 
known  approach  for  this  purpose  uses  the  ratio  of  nonsynonymous  to  synonymous 
substitutions  to  identify  potential  cases  of  functional  change  (Yang  et  al,  2000;  Li  et  al, 
1985;  Messier  and  Stewart,  1997).  This  approach,  however,  suffers  as  a  signature  of 
functional  change  among  distant  branches,  since  silent  sites  quickly  lose  their  signal  as 
they  become  saturated  with  substitutions.  Shifts  in  protein  function  can  also  be  deduced 
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from  instances  of  convergent  or  parallel  evolution  (Messier  and  Stewart,  1997).  In  turn, 
functional  constraints  can  be  detected  as  compensatory  covariation,  whereby  different 
residues  in  contact  are  sequentially  replaced  in  a  way  that  conserves  some  overall 
physical  property  (Chelvanayagam  et  al.,  1997). 

The  covarion  approach  now  offers  another  tool  for  studying  the  evolution  of  protein 
function  (Gu,  1999).  Variability  is  a  feature  of  a  position  which  reflects  its  relation  to 
selected  function.  Thus,  changes  across  groups  in  the  variability  of  their  sites  offer 
insights  into  which  positions  of  a  protein  may  be  most  responsible  for  its  functional 
shifts.  If  the  variability  of  many  positions  changes,  then  the  inference  can  be  made  that 
the  protein  has  acquired  a  new  function  (or  lost  its  function).  However,  this  study  with 
EFs  illustrates  how  much  our  concept  of  function  is  contingent  on  one's  perspective  and 
how  subtle  such  shifts  can  be.  In  detail,  EF-Tu  and  EF-la  function  in  different  ways, 
even  though  their  overall  role  in  translation  has  remained  the  same.  These  more  subtle, 
but  nevertheless,  significant  functional  differences  involve  on  the  order  of  10%  of  the 
sites  according  to  our  covarion  analysis  (Figure  2-5). 

Our  approach  integrates  structural  data  with  a  covarion-based  evolutionary  analysis  to 
improve  the  identification  of  those  relatively  few  sites  that  are  largely  responsible  for  the 
functional  differences  between  EF-Tu  and  EF-la.  Together,  these  two  sources  of 
information  allow  us  to  target  specific  positions  and  residues  for  the  direct 
experimentation  of  their  effects  on  the  function  of  EFs.  Of  particular  interest  are  the 
surface  residues  that  are  evolving  >2  SD  slower  in  eukaryotes  than  in  bacteria.  If 
confirmed  by  direct  testing,  the  involvement  of  at  least  some  of  these  sites  in  binding  EF- 
la  to  actin  would  constitute  one  of  the  only  examples  where  metabolic  channeling,  long 
an  issue  in  central  pathways,  has  left  a  signature  in  the  sequences  themselves  (Reddy  and 
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Pardee,  1 980).  It  is  as  a  tool  for  hypothesis  generation  and  experimental  design  that 
covarion-based  evolutionary  studies,  coupled  with  structural  biology,  will  make  their 
greatest  contributions  to  functional  genomics. 


CHAPTER  3 
DETECTING  FUNCTIONAL  DIVERGENCE  IN  BIOLOGICAL  SEQUENCES: 
HISTOGRAM  APPROACH  USING  LOG-TRANSFORMED  RATE  DIFFERENCES 


Background 

In  the  previous  chapter  we  described  a  new  approach  for  the  identification  of 
sequence  positions  with  changing  evolutionary  rates.  This  approach  was  based  on  the 
non-stationary  covarion  model,  whereby  a  site  can  be  rapidly  evolving  in  one  group  but 
highly  conserved  in  another  (Fitch  and  Markowitz,  1970;  Miyamoto  and  Fitch,  1995). 
This  covarion-based  approach  relied  on  maximum  likelihood  (ML)  methods  to  estimate 
the  evolutionary  rates  of  sites  in  one  group,  then  in  another.  Those  sites  with  the  greatest 
changes  between  groups  were  identified  from  a  frequency  histogram  of  their  rate 
differences.  As  an  illustration  of  this  approach,  we  provided  a  detailed  ML  analysis  of 
elongation  factors  for  bacteria  versus  eukaryotes  (380  aligned  positions  for  13  EF-Tu 
versus  17  EF-lct  sequences,  respectively).  The  analysis  identified  17  and  19  sites  that 
were  evolving  faster  in  bacteria  than  eukaryotes,  and  vice  versa,  respectively.  These  36 
positions  with  the  greatest  rate  differences  were  evaluated  for  their  potential  roles  in  the 
functional  divergence  of  EFs  by  mapping  them  onto  the  known  tertiary  structures  of 
bacterial  EF-Tu. 

The  recognition  of  these  36  sites  was  based  on  a  comparison  of  the  observed  rate 
differences  for  all  380  positions  [with  their  mean  and  standard  deviation  (SD)  of -0.03 
and  1 .52  replacements/site/unit  evolutionary  time,  respectively]  to  their  expected  normal 
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distribution.  These  36  sites  were  highlighted  as  those  with  rate  differences  of  >2  SD  from 
the  mean.  Although  used  in  this  way,  these  cutoffs  were  not  viewed  as  rigorous 
thresholds  of  statistical  significance,  but  were  rather  treated  as  conservative 
approximations  with  heuristic  value.  One  obvious  reason  for  this  conservative 
interpretation  was  that  the  rate  differences  were  not  normally  distributed,  since  they  were 
based  on  a  mixture  of  both  stationary  and  non-stationary  sites. 

More  importantly,  we  need  to  assume  from  the  original  estimates  of  replacement  rates 
that  the  variances  associated  with  the  individual  estimates  get  bigger  for  larger  values  of 
the  rates  themselves.  Such  trends  in  variance  often  occur  because  the  error  in  the  value 
being  estimated  is  a  percent  of  the  value  rather  than  an  absolute  value.  Performing  log- 
transformations  on  the  original  replacement  rates  can  help  to  alleviate  these  biases. 

When  the  underlying  properties  of  a  test  statistic  are  poorly  understood,  computer 
simulations  (parametric  bootstrapping)  offer  one  way  to  generate  the  null  distributions  for 
statistical  testing  (Huelsenbeck  et  al.,  1996).  In  this  way,  null  distributions  are  derived 
from  evolutionary  simulations  for  the  log-transformed  rate  differences  between  bacterial 
EF-Tu  and  eukaryotic  EF-la.  In  these  tests,  the  null  model  consists  of  a  stationary 
gamma  process,  whereby  the  rate  identities  of  sites  remain  constant  over  evolutionary 
time  (Miyamoto  and  Fitch,  1995;  Gu,  1999;  Morozov  et  al,  2000).  As  for  the  non- 
stationary  covarion  process,  rate  heterogeneity  among  sites  is  accommodated  in  this 
model  by  the  gamma  distribution  (Yang,  1996).  Thus,  the  only  distinction  between  the 
stationary  gamma  and  non-stationary  covarion  (gamma)  approaches  is  that  the  latter 
allows  for  the  evolutionary  rates  of  sites  to  change  across  groups. 
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Log-Transformation  Statistics 

The  same  EFs,  accepted  phylogeny,  ML  approaches,  and  software  used  in  the 
previous  chapter  were  adopted  in  these  simulations.  The  ML  analyses  relied  on  the 
Jones,  Taylor,  and  Thornton  matrix  with  rate  variation  among  sites  according  to  the 
gamma  distribution  (JTT-T)  (Jones  et  al,  1992;  Yang,  1996).  Evolutionary  simulations 
involved  two  separate  series  of  100  simulations  apiece,  with  each  trial  consisting  of  30 
simulated  sequences  of  length  380  sites.  In  these  simulations,  a  was  set  at  0.42  and  0.78, 
with  the  former  representing  the  average  of  bacteria  alone  and  eukaryotes  alone.  These 
simulations  were  to  establish  statistical  thresholds  for  the  observed  rate  differences. 

Figures  3-1  and  3-2  show  the  relationships  between  non-  and  log-transformed  rate 
differences  for  the  simulated  bacteria  and  eukaryotic  data  sets,  respectively.  To 
understand  the  importance  of  log-transforming  the  rate  differences,  consider  the 
following  example  for  two  sites.  The  first  site  has  a  replacement  rate  of  zero  in  bacteria 
and  a  rate  of  2.5  in  eukaryotes,  whereas  the  second  site  has  a  rate  of  1 .5  in  bacteria  and 
4.5  in  eukaryotes.  We  can  categorize  these  sites  in  evolutionary  terms;  site  1  is  invariable 
in  bacteria  but  moderately  variable  in  eukaryotes,  and  site  2  is  moderately  variable  in 
bacteria  but  highly  variable  in  eukaryotes.  Using  the  same  statistical  cutoffs  as  we  did  in 
the  previous  chapter  (-3.0,  3.0),  we  would  highlight  site  2  only  as  having  undergone 
functional  divergence  because  it  has  a  rate  difference  of  3.0  (site  1  has  a  difference  of 
2.5).  However,  it  could  be  inferred  that  site  1  has  functionally  diverged  because  it  has 
shifted  between  being  variable  and  invariable.  In  fact,  site  2  should  probably  not  be 
highlighted  because  the  variances  associated  with  the  rate  estimates  are  larger,  and  may 
therefore  overlap,  than  those  associated  with  the  rate  of  zero  in  site  1 . 
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Bacteria  rates  versus  non-transformed  rate  differences 
(Simulated  Data) 
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Bacteria  rates  versus  Log-transformed  rate  differences 
(Simulated  Data) 
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Figure  3-1.  Comparison  of  simulated  bacteria  relative-replacement  rates  to  non-transformed  (top)  and  log- 
transformed  (bottom)  relative-replacement  rate  differences.  Five  random  simulations  (out  of  1 00)  were 
analyzed  with  alpha  equal  to  0.42  and  branch  lengths  as  determined  by  the  EF  ML  calculations. 
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Eukaryotic  rates  versus  non-transformed  rate  differences 
(Simulated  Data) 
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Eukaryotic  rates  versus  Log-transformed  rate  differences 
(Simulated  Data) 
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Figure  3-2.  Comparison  of  simulated  eukaryotic  relative-replacement  rates  to  non-transformed  (top)  and 
log-transformed  (bottom)  relative  replacement  rate  differences.  Five  random  simulations  (out  of  1 00)  were 
analyzed  with  alpha  equal  to  0.42  and  branch  lengths  as  determined  by  the  EF  ML  calculations. 
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Bacteria  rates  versus  non-transformed  rate  differences 
Original  Data) 
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Bacteria  relative-replacement  rates 


Bacteria  rates  versus  Log-transformed  rate  differences 
(Original  Data) 
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Figure  3-3.  Comparison  of  original  bacteria  relative-replacement  rates  to  non-transformed  (top)  and  log- 
transformed  (bottom)  relative  replacement  rate  differences. 
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Eukaryotic  rates  versus  non-transformed  rate  differences 
(Original  Data) 


12  3  4 

Eukaryotic  relative-replacement  rates 


Eukaryotic  rates  versus  Log -transformed  rate  differences 
(Original  Data) 
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Figure  3-4.  Comparison  of  original  eukaryotic  relative-replacement  rates  to  non-transformed  (top)  and  log- 
transformed  (bottom)  relative  replacement  rate  differences. 
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Figure  3-1  shows  how  log-transforming  the  data  can  alleviate  these  confounding 
problems.  We  see  from  the  top  part  of  the  figure  that  a  cluster  of  sites  at  bacterial  rates 
-1.5  have  rate  differences  around  -3.0.  The  corresponding  sites  in  eukaryotes  must  then 
be  around  4.5.  These  sites  can  be  interpreted  as  being  moderately  and  rapidly  evolving  in 
bacteria  and  eukaryotes,  respectively.  Highlighting  these  sites  as  having  undergone  shifts 
in  functional  divergence,  however,  would  be  premature  until  we  could  analyze  the 
variances  associated  with  the  rate  estimates.  We  also  notice  in  the  top  part  of  the  figure 
that  there  are  not  many  points  on  the  graph  that  have  a  bacterial  rate  of  ~0  rate  with  a 
corresponding  rate  difference  near  -3.0.  A  shift  in  function  should  be  most  noticeable 
when  a  site  has  gone  from  invariable  to  variable,  and  vice  versa.  Based  on  the  simulation 
conditions,  most  sites  that  shift  between  variable  and  invariable  can  be  characterized  as 
the  variable  site  having  a  moderately  evolving  rate.  Unfortunately,  our  previous  method 
is  not  able  to  highlight  these  types  of  sites. 

The  problems  associated  with  simply  calculating  rate  differences  can  be  overcome  by 
log-transforming  the  data.  The  bottom  part  of  Figure  3-1  highlights  these  adjustments. 
The  sites  that  previously  had  rate  differences  near  3.0,  with  the  bacterial  rate  -1.5,  have 
been  'pushed'  towards  the  rate  difference  of  zero.  There  is  also  an  increase  in  the  rate 
difference  values  for  sites  where  the  bacterial  rate  is  -0.  Thus,  log-transformations 
eliminate  some  of  the  confounding  effects  of  calculating  non-transformed  rate  differences 
and  they  allow  us  to  analyze  the  sites  in  a  manner  that  is  consistent  with  the  notion  of 
sites  being  'independent  and  identically  distributed.'  These  same  adjustments  take  place 
for  the  eukaryotic  sites  of  the  simulated  data,  Figure  3-2,  and  for  the  original  estimates  of 
rates  for  bacteria  and  eukaryotic  elongation  factors  (Figures  3-3  and  3-4,  respectively). 
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The  mean  limits  of  the  100  frequency  distributions  with  a  =  0.42  (-3.347,  3.269)  were 
more  conservative  (i.e.,  encompassed  a  wider  range)  than  those  with  a  =  0.78  (-2.535, 
2.475).  This  relationship  is  predicted  based  on  Figure  2-8  from  the  previous  chapter. 
Higher  alpha  values  represent  a  more  uniform  distribution  of  rates  (standardized  always 
to  a  mean  of  1 .0),  whereas  low  values  signify  rate  heterogeneity.  Since  the  majority  of 
sites  in  elongation  factors  have  a  conserved  evolutionary  pattern  (slow  rate),  the  limited 
numbers  of  rapidly  evolving  sites  are  forced  to  have  an  evolutionary  rate  similar  to  the 
conserved  sites  when  alpha  equals  0.78.  Also,  this  relationship  with  a  indicated  that  the 
rate  differences  of  sites  were  not  identically  distributed,  as  the  rapidly  evolving  positions 
were  more  heavily  concentrated  in  the  tails  of  their  simulated  frequency  histograms  than 
were  the  slowly  evolving  ones. 

Before  sites  can  be  identified  as  evolving  differently  between  bacterial  and  eukaryotic 
elongation  factors,  statistical  thresholds  first  need  to  be  calculated  from  the  simulations 
with  alpha  equal  to  0.42,  Table  3-1 .  For  the  cutoffs,  the  380  log-transformed  rate 
differences  were  analyzed  in  all  100  trials  individually.  The  means  for  the  100  trials  were 
then  used  as  the  thresholds.  For  example,  when  P=0.01, 4  sites  are  analyzed  (380x0.01) 


Table  3-1 .  Statistical  cutoffs  based  on  the  means  for  the  100  simulated  trials  (log- 
transformed  bacterial  rate  minus  log-transformed  eukaryotic  rate) 
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Figure  3-5.  Significant  sites  according  to  different  statistical  cutoffs  [rates  in  the  second 
column  are  calculated  as  Ln(Bac)-Ln(Euk)}  For  purposes  of  space,  sites  301  and  351  are 
not  shown  but  highlighted  when  P=0.042.  Sites  are  colored  coded  to  their  p  values. 
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in  each  trial.  The  site  with  the  second  largest  (positive  value)  log-transformed  rate 
difference  and  the  site  with  the  second  smallest  (negative  value)  difference  are  recorded 
for  each  trail.  Therefore,  a  total  of  100  positive  and  100  negative  rate  differences  are 
recorded.  The  means  for  the  positive  and  negative  numbers  were  calculated  (Table  3-1) 
and  now  these  means  can  be  used  as  the  cutoffs  to  analyze  the  elongation  factor  data  set, 
Figure  3-5.  A  discussion  of  these  results  will  be  given  in  the  next  chapter. 

Large  rate  differences  are  more  likely  for  rapidly  rather  than  slowly  evolving  sites. 
One  primary  implication  of  this  interpretation  is  that  the  rate  differences  of  the  rapidly 
versus  slowly  evolving  sites  are  not  randomly  distributed  in  their  simulated  frequency 
histograms.  Instead,  the  rate  differences  for  the  former  are  more  heavily  represented  in 
the  tails  of  their  distributions  than  are  those  of  the  latter.  As  a  consequence,  statistical 
thresholds  (e.g.,  >2  SD  in  previous  chapter)  cannot  be  directly  obtained  from  the  tails  of 
the  non-transformed  frequency  distribution.  Nevertheless,  the  log-transformation  of  this 
distribution  resolve  the  biases  associated  with  the  correlation  between  large  variances  and 
large  rates.  Thus,  by  averaging  across  their  1 00  replicates,  the  mean  minimums  and 
maximums  of  these  simulated  frequency  distributions  establish  valid  cutoffs  for  the 
identification  of  covarion  sites. 


CHAPTER  4 

CURRENT  STATUS  AND  FUTURE  PROSPECTS  USING  THE  COVARION 

MODEL:  BAYESIAN  INFERENCE 


Background 

Genomic  sequencing  projects  have  provided  scientists  with  an  abundant  amount  of 
information  to  predict  protein  function  (Bork  and  Koonin,  1998;  Eisenberg  et  ai,  2000). 
Interpreting  the  information  on  a  case-by-case  basis  and  placing  it  within  a  biological 
context  is  a  daunting  task.  Imagine  biochemists  and  molecular  biologists  analyzing  every 
gene  from  every  sequencing  project.  Computational  biology  enables  us  to  manage  this 
information  by  making  fundamental  assumptions  regarding  the  evolutionary  process. 
The  major  annotation  assumption  is  that  orthologous  sequences  have  the  same  function. 
Therefore,  if  we  know  the  function  of  some  protein  in  species  A  and  subsequently 
sequence  a  gene  from  species  B,  and  given  an  arbitrary  statistical  cutoff  of  sequence 
similarity  between  the  two  genes,  then  the  protein  in  species  B  is  said  to  have  the  same 
function  as  in  species  A.  Recent  studies,  however,  suggest  that  this  underlying 
assumption  may  not  hold  true  for  many  cases  of  orthologous  sequences  (see  next 
chapter).  At  the  core  of  these  studies  lies  the  notion  that  functional  importance  is  highly 
correlated  with  conserved  evolutionary  sequence  patterns.  Although  laboratory 
experiments  are  a  necessary  component  to  ultimately  determine  function,  the  goal  of 
computational  biology  is  to  elucidate  the  function  of  genomic  sequences  as  much  as 
possible  without  having  to  perform  these  experiments.  Functional  genomics  then  seeks 
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to  form  a  bridge  between  computational  and  experimental  biology.  Combining 
evolutionary  and  structural  analyses  with  biochemical  data  enables  the  functional 
genomics  field  to  make  predictions  regarding  functional  divergence.  These  predictions 
lead  to  direct  testing  of  individual  residues  in  the  laboratory.  Two  approaches  have 
recently  been  utilized  that  more  accurately  represent  the  underlying  modes  of  divergent 
sequence  evolution,  and  are  therefore  able  to  detect  shifts  along  the  functional  continuum 
(nonsynonymous/synonymous  ratios  and  covarion  approaches).  Due  to  their  ability  to 
detect  changes  in  the  patterns  of  sequence  evolution,  we  predict  that  these  approaches 
will  become  integral  components  of  functional  genomics  studies  in  the  future. 
Evolutionary  Tools  for  Functional  Genomics 
The  most  common  way  to  detect  functional  divergence  in  genomic  sequences  is  to 
calculate  the  nonsynonymous/synonymous  rate  ratio  (Ka/Ks,  dw/ds,  or  co)  (Yang  and 
Bielawski,  2000).  If  a  pair  of  sequences  evolved  under  a  neutral  model  of  evolution  then 
a  comparison  of  the  sequences  will  yield  a  nonsynonymous/synonymous  ratio  of  1  (e.g., 
pseudogenes).  A  pair  of  sequences  under  purifying  selection  will  display  a  ratio  less  than 
1 ,  whereas  sequences  under  diversifying  selection  display  a  ratio  greater  than  1 .  Until 
recently,  methods  calculated  this  ratio  across  all  sites  and  therefore  positive  selection 
could  only  be  detected  if  the  average  ratio  across  all  sites  was  greater  than  1.  A  serious 
limitation  to  these  methods  is  evident.  How  can  we  detect  positive  selection  within  a 
background  of  otherwise  purifying  selection?  Yang  and  colleagues  have  generated  a 
method  for  detecting  and  identifying  positive  selection  at  individual  sites.  A  Bayesian 
approach  is  used  to  calculate  the  probability  that  a  site  is  under  diversifying  selection. 
Although  this  approach  is  powerful  for  identifying  individual  sites  that  are  responsible  for 
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functional  divergence  to  subsequently  be  tested  in  the  laboratory,  the  approach  is  not 
applicable  to  older  evolutionary  events  as  DNA  becomes  mutationally  saturated  at  the 
third  codon  position  more  rapidly  than  the  first  and  second  positions.  This  sequence 
saturation  will  therefore  lead  to  biases  in  the  estimations  of  co,  compromising  the  ability 
to  detect  functional  divergence. 

The  covarion  (covariotide)  model  utilizes  amino  acid  (DNA)  data  and  can  also  be 
used  to  detect  functional  divergence  among  sequences.  Originally  proposed  by  Fitch  and 
colleagues,  the  non-stationary  covarion  hypothesis  allows  the  replacement  rate  at  an 
individual  position  to  vary  in  different  branches  of  the  evolutionary  tree  (Fitch  and 
Markowitz,  1970;  Miyamoto  and  Fitch,  1995).  A  site  may  be  variable  in  one  lineage  but 
invariable  in  another  lineage.  This  is  in  contrast  to  the  stationary  gamma  model  (Yang, 
1996).  Under  this  model  the  replacement  rate  can  vary  among  positions  within  a 
sequence,  like  the  covarion  model,  but  the  rate  at  any  individual  position  must  remain 
constant  across  all  lineages.  Previous  studies  have  demonstrated  that  the  covarion  model 
can  more  accurately  describe  the  mode  of  evolution  for  some  proteins  as  compared  to  the 
gamma  model  (Lockhart  et  al,  1998;  Lopez  et  al,  1999;  Gaucher  et  al,  2001).  Although 
conceived  nearly  30  years  ago,  a  literature  search  will  show  that  the  majority  of  papers 
discussing  the  covarion  hypothesis  have  been  published  within  the  past  few  years 
(Barbrooke/a/.,  1998;  Finkelstein^a/.,  1998;  Tuffleyand  Steel,  1998;  Gu,  1999; 
Lockhart  et  al,  1999;  Moreira  et  al,  1999;  Penny  et  al,  1999;  Philippe  and  Forterre, 
1999;  Collins  et  al,  2000;  Fisher  et  al,  2000;  Lockhart  et  al,  2000;  Naylor  and  Gerstein, 
2000;  Philippe  and  Germot,  2000;  Philippe  et  al,  2000;  Steel  et  al,  2000;  Marin  et  al, 
2001).  This  is  due  in  part  to  enhanced  phylogenetic  methods,  computational  speed,  and  a 
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greater  need  to  understand  molecular  evolution  in  light  of  mass-sequencing  projects. 
Fundamental  to  this  understanding  is  the  idea  that  shifts  in  sequence  variability  patterns 
can  be  indicative  of  shifts  in  protein  function.  Since  the  covarion  model  can  detect  these 
shifts  in  sequence  variability,  the  model  holds  considerable  promise  for  detecting 
functional  divergence  among  various  lineages.  The  remainder  of  this  chapter  will  discuss 
recent  advances  using  the  covarion  model  and  its  utility  for  functional  genomics. 
Covarion  Approaches:  Methods  of  Overall  Sequence  Comparisons 

Establishing  a  theoretical  framework  is  an  important  first  step  when  attempting  to 
develop  a  method  based  on  a  proposed  model.  The  covarion  model  is  no  exception. 
Tuffley  and  Steel  have  recently  demonstrated  this  framework  for  reconstructing 
phylogenies  (Tuffley  and  Steel,  1998).  These  authors  show  that  the  gamma  and  covarion 
models  can  be  distinguished  under  specific  conditions.  Their  results  are  based  on  the 
formulation  of  a  distance  measure  that  is  tree  additive  under  certain  conditions  using  the 
covarion  model  [Kimura  three-substitution  (K3ST),  Kimura  two  parameter  (K2P)  and 
Jukes-Cantor  (JC)]  but  is  not  tree  additive  under  the  gamma  model.  Thus  the  covarion 
method  can  generate  information  from  sequences  that  can  ultimately  enable  a  tree 
topology  to  be  recovered  quickly  and  uniquely. 

Lockhart  and  colleagues  have  developed  statistical  tests  that  can  determine  if  a  set  of 
sequences  has  evolved  according  to  a  gamma  model  or  a  covarion  model  (Lockhart  et  al. , 
1998;  Lockhart  et  al.,  2000).  These  statistics,  contingency  and  inequality  tests,  were 
applied  to  data  sets  containing  16S  rDNA  and  elongation  factor  (EF)  sequences  to 
elucidate  the  relationship  of  oxygenic  photosynthetic  lineages.  For  both  data  sets,  the  test 
statistics  showed  that  the  covarion  model  more  accurately  depicted  the  evolution  of  these 
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sequences  as  a  whole  compared  to  the  gamma  model.  To  determine  the  effects  of 
covarion  behavior  on  tree  topology,  the  authors  phylogenetically  analyzed  the  1 6S  rDNA 
and  EF  sequences  after  sequentially  removing  the  sites  that  displayed  covarion  behavior, 
or  functional  divergence  (invariable  in  one  lineage  while  variable  in  the  other).  This 
analysis  revealed  that  the  majority  of  bootstrap  support  for  the  original  topology  was  in 
fact  dependent  on  covarion  sites.  Therefore  sites  displaying  non-stationary  evolutionary 
patterns  produced  the  most  signal  in  support  of  the  robust  topology. 

In  a  similar  study  concerned  with  phylogeny  reconstructions,  Philippe  and  colleagues 
analyzed  the  root  of  the  Tree  of  Life  in  light  of  the  covarion  model  using  EFs  (Lopez  et 
al.,  1999).  The  root  of  the  Tree  of  Life  (or  Universal  Tree)  is  determined  through  the 
connectivity  of  two  generated  trees  using  anciently  duplicated  genes.  Previous  studies 
using  EFs  have  suggested  that  the  root  lies  on  the  branch  separating  bacteria  from  archaea 
and  eukaryotes.  In  lieu  of  the  fact  that  EFs  are  highly  conserved  but  mutationally 
saturated  at  variable  positions,  Lopez  et  al.  reanalyzed  the  placement  of  the  root.  These 
authors  developed  a  method  that  utilizes  parsimony  to  calculate  the  number  of 
substitutions  at  each  site  and  compares  these  estimates  to  a  given  threshold  to  increase  the 
signal-to-noise  ratio.  A  matrix  was  thus  generated  which  contains  information  regarding 
the  site-by-site  relationships  to  the  threshold.  Incorporating  this  matrix  into  the  tree 
building  process  resulted  in  a  Universal  Tree  rooted  on  the  branch  separating  eukaryotes 
from  archaea  and  bacteria,  albeit  not  robustly.  These  authors  showed  that  the  covarion 
model,  when  implemented  in  parsimony  analyses,  provides  a  better  explanation  of  the 
evolutionary  process  for  EFs  than  the  gamma  model.  Both  of  the  above  studies  suggest 
that  shifts  in  evolutionary  rates  influence  the  resolved  topologies.  Since  evolutionary 
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rates  can  be  correlated  to  function,  both  studies  suggest  episodes  of  functional  divergence 
have  taken  place  within  their  evolutionary  trees.  Future  work  will  need  to  determine 
when  covarion  behavior  is  phylogenetically  informative  and  when  it  is  misleading. 
Covarion  Approaches:  Non-Baysian-based  Methods  for  Identifying  Sites 
Although  the  above  examples  prove  interesting  for  phylogeny  reconstructions  by 
analyzing  sequences  as  a  whole,  studies  that  provide  detailed  information  regarding 
individual  sites  within  a  covarion  framework  have  proven  more  applicable  to  the  field  of 
functional  genomics.  Along  these  lines,  Philippe  and  colleagues  studied  the  relationships 
of  eukaryotic  lineages  using  EFs  (Moreira  et  ah,  1999).  By  analyzing  the  site-by-site  rate 
variation  as  estimated  by  parsimony,  Moreira  et  at.  showed  that  ciliates  contain  the 
largest  number  of  variable  positions  for  all  eukaryotic  lineages  studied.  In  contrast, 
ciliates  do  not  show  such  high  variability  in  a-Tubulin  or  SSU  RNA  when  compared  to 
other  lineages.  The  distribution  of  the  sites  displaying  high  variability  in  EFs  was 
mapped  along  the  primary  sequence  structure  to  identify  whether  any  known  protein 
motifs  were  affected  by  the  increased  variability.  Motifs  involved  in  translational 
activity,  such  as  GDP-,  GTP-,  and  aminoacyl-tRNA-binding,  were  equally  conserved  in 
ciliates  as  in  other  lineages.  However,  sites  displaying  high  variability  unique  to  ciliates 
were  concentrated  in  regions  that  are  putative  actin-binding  motifs.  This  is  consistent 
with  actin  being  a  quantitatively  minor  protein  and  having  an  accelerated  mutation  rate  in 
ciliates  (c.f.  Moreira  et  al,  1999).  Future  work  will  determine  if  the  high  variability  in 
these  two  molecules  is  due  to  a  coevolutionary  acceleration  and/or  a  loss  of  their 
interactions. 
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The  above  study  highlights  the  importance  of  incorporating  the  covarion  model  to 
analyze  functional  divergence.  The  authors  were  able  to  hypothesize  that  ciliates  display 
uncharacteristically  rapid  sequence  evolution  in  regions  of  EF  that  may  responsible  for 
protein-protein  interactions.  Being  able  to  assign  individual  sites  as  covarion-like 
afforded  these  authors  the  opportunity  to  attempt  to  correlate  functional  differences 
among  eukaryotic  lineages  to  specific  sites  in  the  EF  sequence. 

Incorporating  maximum  likelihood  (ML)  methods  in  phylogenetics  can  be 
advantageous  when  analyzing  complex  modes  of  evolution  (Felsenstein,  1981).  This  is 
especially  true  when  trying  to  identify  specific  sites  of  interest  across  diverse  lineages  for 
direct  testing  in  the  laboratory  to  determine  functional  divergence.  As  previously 
discussed,  comparing  the  number  of  substitutions  site-by-site  between  two  lineages  is 
necessary  for  determining  covarion  behavior  within  a  functional  genomics  framework. 
Therefore,  it  is  important  to  accurately  estimate  the  number  of  expected  substitutions  for 
a  site  along  a  given  topology.  Parsimony  may  underestimate  this  parameter  when  the 
sequences  have  undergone  parallel  and/or  back  mutations.  If  the  correct  model  is 
incorporated,  ML  is  less  susceptible  to  these  confounding  factors  because  the  method 
incorporates  branch  lengths  and  an  explicit  substitution  matrix  into  its  calculations.  In 
addition,  incorporating  the  gamma  distribution  can  provide  a  more  accurate  estimate  of 
the  site-by-site  substitution  rates. 

The  gamma  model  may  be  considered  a  subset  of  the  covarion  model.  Both  models 
allow  rates  to  vary  among  sites.  However,  the  gamma  model  requires  that  those  rates 
remain  the  same  (stationary)  throughout  evolutionary  time,  whereas  the  covarion  model 
allows  rates  to  fluctuate  at  individual  positions  between  lineages  throughout  time  (non- 
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stationary).  Gu  has  mathematically  demonstrated  that  if  two  lineages  exhibit  functional 
divergence,  and  if  the  functional  divergence  is  ignored,  then  the  estimation  of  the  gamma 
distribution's  shape  parameter  alpha,  a,  is  biased  (Gu,  1999). 

Along  these  lines,  we  demonstrated  that  a  covarion  model  explains  the  evolution  of 
EFs  between  bacteria  and  eukaryotes  more  accurately  than  the  gamma  model  (Gaucher  et 
al,  2001).  Using  ML  methods,  the  estimated  alpha  value  was  0.48  and  0.36  for  the 
bacterial  and  eukaryotic  groups,  respectively.  However,  when  the  groups  were  combined 
the  alpha  value  increased  to  0.78.  This  shows  that  the  evolutionary  process  for  EFs  is  not 
stationary,  as  had  it  been  stationary  the  alpha  value  would  have  been  similar  to  the  values 
obtained  for  the  groups  individually.    Parametric  bootstrapping  simulations  and  sub- 
sampling  experiments  indicated  that  the  alpha  values  were  extremely  robust  to 
fluctuations.  The  non-stationary  behavior  of  alpha  for  this  data  set  indicated  the 
importance  of  invoking  the  covarion  model  to  explain  the  evolution  of  EFs. 

As  mentioned  above,  these  types  of  covarion  studies  become  powerful  for  functional 
genomics  when  individual  sites  are  highlighted  as  displaying  the  most  prominent 
covarion  behavior  and  subsequently  correlated  to  known  functional/structural  differences 
between  the  lineages  under  study.  We  used  a  histogram  approach  to  identify  covarion 
sites  in  EFs.  The  site-by-site  replacement  rates  for  bacteria  were  estimated  by  ML  and 
compared  to  the  site-by-site  replacement  rates  for  eukaryotes.  A  histogram  was 
generated  for  the  site-by-site  rate  differences  between  the  two  groups.  The  histogram 
was  leptokurtotic,  the  mean  and  tails  were  over-represented  while  the  shoulders  were 
under  represented,  as  compared  to  the  expected  distribution.  Sites  evolving  under  a 
stationary  process  had  rate  differences  centered  around  the  mean  of  zero.  Sites  evolving 
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according  to  an  extreme  non-stationary  process  had  rate  differences  in  the  tails  of  the 
distribution.  A  total  of  36  sites,  out  of  380,  were  highlighted  as  having  a  rate  difference 
of  >2  SD.  The  36  sites  were  mapped  onto  the  three-dimensional  structures  of  EFs.  This 
enabled  us  to  generate  testable  hypotheses  regarding  known  and  putative 
functional/structural  differences  for  EFs  between  bacteria  and  eukaryotes  in  the  GDP-, 
GTP-,  aminoacyl-tRNA-,  and  actin-binding  regions  (see  Chapter  2).  A  functional 
genomics  analysis  was  thus  realized  by  correlating  experimental  and  computational 
results. 

Covarion  Approaches:  Bayesian-based  Methods  for  Identifying  Sites 
Gu  (99)  has  recently  developed  a  method  for  detecting  shifts  in  functional  constraints 
between  two  or  more  gene  clusters  (lineages)  as  calculated  from  the  expected  number  of 
replacements  at  individual  sites.  The  model  calculates  the  posterior  probability  for  any 
site  being  in  a  state  of  functional  constraint  or  functional  divergence.  The  Bayesian 
approach  has  recently  gained  attention  within  the  field  of  molecular  evolution  for  its 
unique  ability  to  calculate  confidence  intervals  around  estimated  parameters,  given  a  data 
set  (Yang  and  Rannala,  1997;  Larget  and  Simon,  1999;  Lewis,  2001).  These  probabilistic 
statements  cannot  be  generated  by  other  likelihood  methods  that  optimize  the  function  by 
calculating  the  probability  of  observing  the  data  set,  given  the  parameters.  It  is  assumed 
that  when  a  site  is  invariable  in  one  lineage  but  highly  variable  in  a  different  lineage,  the 
site  has  undergone  functional  divergence  in  one  of  the  lineages  compared  to  the  ancestral 
state.  This  coefficient  of  functional  divergence,  G,  is  calculated  according  to  a  maximum 
likelihood  model  that  incorporates  rate  heterogeneity  among  sites  and  suggests  whether  a 


81 

significant  number  of  sites  have  altered  functional  constraints  between  the  analyzed  gene 
clusters. 

When  the  9  value  indicates  functional  divergence,  a  value  significantly  greater  than 
zero,  a  site-specific  profile  using  a  hidden  Markov  model  (Bayesian)  is  generated  to 
predict  which  sites  are  statistically  most  likely  to  be  responsible  for  the  divergence.  This 
Bayesian-based  approach  calculates  the  posterior  probability  of  a  site  being  in  a  state  of 
functional  constraint  (Fo)  or  functional  divergence  (Fi): 

PrtemwB) 

Pr(A) 

Within  the  Bayesian  framework,  A  represents  the  data  while  B  represents  a 
hypothesis  (or  state)  and  are  integrated  by  the  following: 

Pr(B|A),  posterior  probability,  is  the  probability  of  a  site  being  in  state  Fo  or  ¥\, 
given  the  data. 

Pr(B),  prior  probability,  is  the  unconditional  probability  of  the  hypothesis, 
represented  by  9  in  this  case. 

Pr(  A|B),  likelihood,  is  the  probability  of  the  data,  given  the  hypothesis. 

Pr(A),  unconditional  probability  of  the  data,  used  to  standardize  all  possible 
probabilities  so  that  the  Pr(B|A)  for  all  alternatives  sum  to  1.  Thus,  the  probabilities  of  Fo 
and  Fi  sum  to  1  for  all  individual  sites. 

Incorporating  this  method,  Gu  has  predicted  that  functional  divergence  has  taken 
place  between  mammalian  transferrins  and  lactotransferrins,  mammalian  and  non- 
mammalian  transferrins,  and  between  all  combinations  of  N-,  C-,  and  L-myc  genes.  In 
addition,  the  method  has  predicted  which  residues  are  most  likely  responsible  for  this 
functional  divergence. 
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This  approach  contains  formulations  that  can  be  advantageous  over  other  methods. 
Calculating  the  posterior  probability  eliminates  some  of  the  problems  associated  with  the 
existence  of  linear  relationships  between  simply  calculating  site-by-site  replacement 
differences  between  two  lineages.  For  example,  a  site  has  0  replacements  in  one  lineage 
and  6  replacements  in  another  lineage,  while  a  different  site  has  9  replacements  in  the 
first  lineage  and  15  in  the  other.  Both  sites  have  a  difference  of  6  replacements  between 
the  two  lineages,  however  only  the  former  site  should  be  highlighted  as  undergoing 
functional  divergence.  This  is  due  to  the  former  site  shifting  between  variable  and 
invariable,  while  the  latter  site  is  highly  variable  in  both  lineages.  The  variances 
associated  with  the  estimated  expected  replacements  in  both  lineages  for  the  latter  site  are 
more  likely  to  overlap  and  therefore  negate  the  significance  of  6  replacements. 
Accounting  for  this  non-linear  relationship  of  replacement  differences  greatly  enhances 
the  ability  to  detect  covarion  behavior  as  measured  by  functional  divergence  (see 
previous  chapter). 

We  have  applied  Gu's  method  to  analyze  our  EF  data  set  (Figure  4-1).  The  0  value 
was  0.71  ±0.03  and  thus  significantly  greater  than  0  and  indicating  functional  divergence 
has  taken  place  between  bacterial  and  eukaryotic  EFs.  The  site-specific  profile 
highlighted  49  sites  as  having  a  posterior  probability  of  95%  or  greater  and  thus  being  in 
a  state  of  functional  divergence  between  the  two  lineages.  Twenty-eight  of  these  sites 
overlap  with  the  sites  highlighted  by  the  original  histogram  method.  A  total  of  24  sites 
were  evolving  more  rapidly  in  eukaryotes  than  in  bacteria.  The  majority  of  these  sites  are 
on  the  surface  of  the  protein  and  are  found  in  known  protein-  and  nucleotide-binding 
domains.  Alternatively,  25  sites  were  evolving  more  rapidly  in  bacteria  than  eukaryotes. 
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Eight  of  these  sites  lie  in  known  binding  domains,  while  15  sites  lie  on  the  surface  in 
areas  with  no  known  function.  The  majority  of  these  sites  do,  however,  lie  in  putative 
actin-  and  ribosome-binding  domains.  The  increased  evolutionary  rate  of  these  sites  is 
consistent  with  actin  being  absent  in  bacteria  and  with  the  structural  differences  of  the 
bacterial  and  eukaryotic  ribosomes.  Therefore,  this  approach  of  combining 
computational  and  experimental  biology  has  predictive  value  regarding  functional 
divergence. 

In  light  of  Gu's  method  being  the  most  statistically  sophisticated  of  all  the  current 
programs  available  for  detecting  functional  divergence,  we  asked  whether  our  log- 
transformed  approach  is  comparable  to  his  approach.  We  chose  the  threshold  values  that 
corresponded  to  P=0.016  from  the  simulations  (see  Table  3-1).  Figure  4-2  compares  the 
results  using  the  original  rate  differences,  Gu's  methods,  and  log-transformed  rate 
differences  for  the  elongation  factor  data  set.  Both  the  log-transformed  and  Gu  analyses 
highlight  49  sites  as  being  the  most  statistically  significant  diverging  positions  between 
bacteria  and  eukaryotes  using  the  arbitrary  cutoffs  of  98.4%  and  95%,  respectively. 
Surprisingly,  thirty-seven  of  these  49  sites  overlap.    The  ability  of  the  log-transformed 
method  to  give  results  similar  to  Gu's  method  can  be  further  analyzed.  The  importance 
of  log-transformed  rate  differences  is  shown,  for  example,  at  positions  141  (row  55)  and 
288  (row  8).  Position  288  is  invariable  in  bacteria  but  moderately  evolving  in  eukaryotes. 
It  can  therefore  be  inferred  that  this  site  has  undergone  functionally  divergence  because 
the  selective  constraints  acting  on  this  site  have  shifted.  Due  to  the  fact  that  the 
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(a) 


33-39 


aa-tRNA 


Figure  4-1.  Functional  divergence  of  bacterial  and  eukaryotic  EFs.  (a)  Tertiary  structure 
of  the  GTP-bound  state  for  EF-Tu  from  T.  aquaticus  (Song  et  al.,  1999).  Green  and  red 
highlight  those  49  sites  that  have  posterior  probabilities  greater  than  or  equal  to  95% 
according  to  Gu's  method  and  are  evolving  faster  in  bacteria  than  in  eukaryotes,  and  vice 
versa,  respectively,  (b)  The  known-  and  putative-behavioral  roles  of  the  sites  highlighted 
in  part  A.  The  majority  of  these  sites  can  be  implicated  for  the  functional  and  structural 
differences  between  bacterial  and  eukaryotic  EFs  in  terms  of  translational  and  actin- 
binding  activities. 
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Figure  4-1  continued 
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original  rate  difference  at  this  position  is  not  very  large,  our  original  approach  was  not 
able  to  highlight  this  site.  However,  both  the  Gu  and  log-transformed  approaches 
highlight  this  site  as  functional  divergent  between  the  two  lineages.  Alternatively, 
position  141  is  moderately  variable  in  both  bacteria  and  eukaryotes.  Since  this  site  is  at 
opposite  ends  of  the  'moderately  variable'  spectrum,  our  original  approach  highlighted 
this  site  because  of  the  large  rate  difference.  Both  the  Gu  and  log-transformed  methods 
agree  in  not  highlighting  this  site.  Thus  these  two  approaches  are  comparable  and  out 
perform  less-sophisticated  methods  for  detecting  functional  divergence. 


Figure  4-2.  Comparison  of  three  different  methods  for  highlighting  functionally 
divergent  sites.  The  methods  include  our  original  approach  for  calculating  rate 
differences,  Gu's  approach,  and  our  log-transformed  rate  differences  approach.  The 
columns  are  represented  by  the  following: 

A)  The  row  number  of  this  table. 

B)  The  site  number  corresponding  to  the  EF  multiple  sequence  alignment. 

C)  The  expected  number  of  amino  acid  replacements  for  the  bacterial  lineage  using 
Gu's  program.  This  number  is  different,  but  similar,  to  the  expected  replacement 
rates  that  we  calculate  using  our  approach. 

D)  The  expected  number  of  amino  acid  replacements  for  the  eukaryotic  lineage. 

E)  The  posterior  probabilities  according  to  Gu's  method. 

F)  Our  original  rate  differences  between  the  bacterial  and  eukaryotic  lineages. 

G)  Sites  highlighted  in  our  original  analysis. 

H)  Sites  highlighted  as  having  a  posterior  probability  greater  than  95%. 

I)    Sites  highlighted  using  the  P=0.016  statistical  thresholds  generated  from  the 

simulated  log-transformed  rate  differences. 
J)    The  log-transformed  rate  differences  between  bacteria  and  eukaryotic  EFs. 
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X 

X 

X 
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0.99638 

-4.154 

X 

X 

X 

-3.510145205 

3 

336 

0 

7.8765 
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0.97032 

-1.739 

X 

X 

-2.527052271 

19 

337 

0 

5.6912 
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X 
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20 
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-2.145 
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21 
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-1.274 

X 
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22 
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0 

7.2437 
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-1.812 

X 

X 
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23 
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8.7365 
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X 
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0.8839 
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0.92667 

-3.739 

X 

-1.865380705 

38 

205 

0 

2.2034 

0.80104 
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-0.636 
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232 

1.0244 

3.2455 
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154 
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4.6014 

0.76693 
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98 

147 
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-1.046 
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3.3873 

0.55596 

-0.574 
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107 

195 
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7.437 

0.35846 

-1.781 
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352 
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2.1213 

0.60712 

-0.473 
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239 

3.247 

7.9233 

0.61745 

-0.978 
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321 

0 

0 

0.5352 

-0.079 
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111 

201 

2.1501 

2.2034 

0.54797 
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29 
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-0.47865691 
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Ln(bac)-Ln(euk) 

Mean 

-0.03175 

-3.07 

95% 

-2.2829 
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In  all,  we  believe  the  above  analyses  demonstrate  the  well-rounded  initiation  of  a 
functional  genomics  study.  Evolutionary  divergent  sequence  patterns  were  highlighted 
and  mapped  onto  the  three-dimensional  structure  of  EF  and  subsequently  correlated  with 
known  biochemical  differences  between  the  two  lineages  under  study.  The  successful 
completion  of  a  functional  genomics  study  will  now  require  site-mutagenesis  studies  to 
determine  if  the  sites  displaying  divergent  evolutionary  patterns  are  responsible  for  some 
or  all  of  the  biochemical/behavior  differences  of  EFs. 

Predictive  Power  of  Covarion  Analyses 

Based  on  Figure  4-1,  we  would  predict  that  the  binding  mechanisms  of  nucleotide 
exchange  factors  to  their  respective  EFs  are  not  equivalent  in  eukaryotes  and  bacteria. 
This  hypothesis  is  formulated  based  on  the  fact  that  eukaryotes  display  rapid  replacement 
rates  at  many  positions  that  are  correspondingly  conserved  in  bacteria,  and  these 
positions  have  been  shown  to  bind  the  nucleotide  exchange  factor  in  the  latter  lineage. 
Recent  experimental  evidence  supports  our  hypothesis.  The  crystal  structure  of 
eukaryotic  yeast  EF-la  bound  to  its  nucleotide  exchange  factor  has  been  determined  at 
the  1.67  A  level  (Andersen  et  al,  2000). 

The  overall  structure  of  eukaryotic  EF-la  is  very  similar  to  the  bacterial  EF-Tu,  as 
predicted  based  on  the  high  degree  of  sequence  conservation  between  the  two  lineages. 
However,  unforeseen  to  experimental  biologists  but  predicted  by  our  comparative 
evolutionary  approach,  the  eukaryotic  nucleotide  exchange  factor  does  not  bind  the  same 
EF  regions  as  demonstrated  for  EF-Tu:EF-Ts.  Figure  4-3  shows  that  EF-la  binds  its 
nucleotide  exchange  factor  predominately  at  the  interface  of  domains  1  and  2. 
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33-39 


aa-tRNA 


Figure  4-3.  Crystal  structure  of  EF-Tu  with  substrate  binding  domains  labeled.  The 
binding  domain  of  nucleotide  exchange  factor  and  EF-la  is  highlighted  in  transparent 
blue.  Note,  the  binding  domains  for  the  nucleotide  exchange  factor  are  significantly 
different  between  the  bacterial  EF-Tu  and  eukaryotic  EF-la. 
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Surprisingly,  this  is  the  same  region  that  binds  tRNA.  In  fact,  many  of  the  residues  in 
EF-la  that  bind  tRNA  also  bind  the  nucleotide  exchange  factor.  Further  investigation 
shows  that  three  EF  positions  are  slowly  evolving  in  eukaryotes  but  rapidly  evolving  in 
bacteria  (covarion  behavior)  and  are  in  regions  that  form  hydrogen  bonds  between  the 
exchange  factor  and  EF-la  domain  2.  As  EF-la  has  been  shown  to  interact  with  tRNA 
synthetases  via  the  exchange  factor,  along  with  the  fact  that  eukaryotic  tRNA  is  not  free 
in  solution  and  EF-la  can  bind  actin,  these  results  support  the  idea  that  eukaryotic  EFs 
are  involved  in  'tRNA-channeling'  from  the  nucleus  to  the  ribosome.  This  functional 
annotation  cannot  be  applied  to  EF-Tu  since  bacteria  lack  actin  and  a  nucleus. 

In  addition  to  the  above  example,  the  recent  crystal  structure  data  suggest  that  EF-la 
does  not  undergo  major  conformational  shifts  between  the  GTP-  and  GDP-bound  states, 
as  seen  in  Figure  2-6.  This  was  predicted  based  on  experimental  data  (Negrutskii  and 
El'skaya,  1998)  and  by  our  analysis.  Residues  33-39  are  rapidly  evolving  in  bacteria  and 
are  thought  to  act  as  a  hinge  thereby  allowing  domain  1  to  re-orient  itself  relative  to 
domains  2  and  3  between  the  GTP-  and  GDP-bound  states  in  bacteria.  In  fact,  the  crystal 
structure  of  this  stretch  of  residues  has  never  been  elucidated  due  to  high  resonance  and 
therefore  does  not  display  a  characteristic  secondary  structure  such  as  an  alpha  helix  and 
beta  strand  in  EF-Tu.  This  stretch  of  residues  is  conserved  in  EF-la  and  is  an  alpha 
helix,  as  predicted  by  us  (c.f.  Gaucher  et  al.,  2001).  Using  EF-Tu:GTP  as  a  reference 
point,  domain  1  is  rotated  by  25°  relative  to  domains  2  and  3  in  EF-la  bound  to  the 
nucleotide  exchange  factor,  and  83°  in  EF-Tu:GDP  and  EF-Tu:EF-Ts  (Kawashima  et  al, 
1996;  Andersen  et  al,  2000).  This  large  degree  of  rotation  prevents  EF-Tu  from  binding 
tRNA  in  the  GDP  bound  state.  We  can  assume  the  structure  of  EF-la:GTP  is  similar  to 
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the  structure  of  EF-la  bound  to  the  exchange  factor,  and  thus  similar  to  EF-la:GDP, 
because  EF-la  can  bind  to  tRNA  in  both  the  GTP-  and  GDP-bound  states  (Negrutskii 
and  EFskaya,  1998).  It  would  therefore  appear  our  prediction  is  correct  concerning  the 
lack  of  major  conformational  shifts  in  eukaryotic  EF-la.  We  would  suggest  that  it  might 
be  possible  to  synthesize  an  organic  molecule  that  can  bind  in  the  large  pocket  unique  to 
the  EF-Tu:GDP  state  without  binding  to  EF-la.  This  may  prevent  EF-Tu  from 
recharging  and  thus  act  as  an  antibiotic. 

These  recent  data  strongly  support  some  of  the  hypotheses  previously  generated 
(Gaucher  et  al.,  2001).  Thus,  incorporation  of  the  covarion  model  into  comparative 
evolutionary  studies  holds  considerable  promise  for  detecting  functional  divergence 
among  proteins  at  the  level  of  sequence  analysis. 

Covarion  Approaches,  Evolutionary  Tools,  and  Functional  Genomics 

Functional  divergence  can  be  detected  by  changes  in  evolutionary  rates  (Messier  and 
Stewart,  1997;  Golding  and  Dean,  1998;  Chang  and  Donoghue,  2000).  For  the  covarion 
model,  functional  divergence  is  best  detected  when  a  site  is  conserved  in  one  lineage  but 
highly  variable  in  another  lineage.  This  means  that  the  site  has  gained  a  functional  role  in 
one  lineage,  or  lost  its  role  in  the  other,  compared  to  the  ancestral  state.  However,  the 
covarion  model  cannot  detect  all  episodes  of  functional  divergence.  When  a  site  has  a 
specific  function  and  is  highly  conserved  in  the  ancestral  state  and  one  descendent 
lineage  retains  the  functional  state  while  another  descendent  lineage  losses  the  functional 
state  only  to  gain  a  new  function  later,  the  covarion  approach  cannot  identify  this  change 
in  functional  behavior  because  the  site  is  now  conserved  in  both  lineages.  This  behavior 
is  exemplified  by  the  EF  results.  The  large  helix  in  the  upper  right  hand  corner  of  domain 
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1  in  EF-Tu  binds  the  nucleotide  exchange  factor  EF-Ts  in  bacteria,  whereas  EF-la  does 
not  bind  its  exchange  factor  here  (Figure  4-3).  Based  on  sequence  similarity,  it  is 
hypothesized  that  this  region  in  EF-la  binds  actin  (Yang  et  al,  1990).  Thus,  these 
positions  are  conserved  in  both  lineages  (invariant  but  different)  yet  they  perform 
different  functions.  Although  limited  in  this  way,  the  covarion  model  holds  considerable 
promise  for  future  studies.  In  addition,  the  non-stationary  behavior  of  parameters  in 
phylogenetic  analyses  is  beginning  to  gain  considerable  attention.  Whether  analyzing 
codon-usage  (Sharp  and  Matassi,  1 994),  evolutionary  distance  (Gu  and  Li,  1 998),  rate 
variation  (Grishin  et  al,  2000;  Morozov  et  al,  2000),  or  base  composition  (Mooers  and 
Holmes,  2000),  non-stationary  models  are  proving  to  be  more  accurate  at  representing  the 
dynamic  evolutionary  process  compared  to  stationary  versions. 

The  covarion  approach  now  offers  an  alternative  to  the  nonsynonymous/synonymous 
approach  for  detecting  functional  divergence  in  genomic  sequences.  However,  we  do  not 
view  these  approaches  as  being  mutually  exclusive.  Functional  genomics  analyses  are 
enhanced  when  these  two  approaches  are  used  in  concert.  Both  methods  can  be  reliable 
for  determining  recent  evolutionary  events  resulting  in  functional  divergence  (although 
the  covarion  model  is  predicted  to  work  better  than  to  for  older  evolutionary  events).  As 
demonstrated  for  EFs  and  by  Yang,  both  methods  can  detect  functional  divergence  within 
a  background  of  conserved  sequence  evolution.  In  this  review,  we  have  shown  that 
covarion  behavior  can  be  detected  using  multiple  approaches  (statistical  tests,  parsimony, 
and  maximum  likelihood)  and  that  individual  sites  can  be  highlighted  using  multiple 
approaches  (variable/invariable,  histogram,  Markov/Bayesian).  Although  the  covarion 
hypothesis  is  difficult  to  model,  its  incorporation  into  phylogenetic  analyses  has  furthered 


101 

our  understanding  of  molecular  evolution.  Overall,  we  advocate  the  use  of  evolutionary 
approaches  for  determining  protein  function.  Simple  statistical  tests  (e.g.,  BLAST, 
Lipman  and  Pearson,  1985)  fail  to  identify  important  historical  events  that  led  to 
functional  divergence.  Therefore,  functional  genomics  studies  are  at  their  most  powerful 
when  more  realistic  evolutionary  models  are  incorporated  into  the  computational  aspects 
of  the  study.  We  predict  that  functional  genomics  will  utilize  sophisticated,  yet  easily 
manageable,  evolutionary  tools  to  infer  protein  function  in  the  future. 


CHAPTER  5 
EVOLUTION,  LANGUAGE  AND  ANALOGY  IN  FUNCTIONAL  GENOMICS 

Background 

Almost  a  century  ago,  Wittgenstein  pointed  out  that  theory  in  science  is  intricately 
connected  to  language.  This  connection  is  not  a  frequent  topic  in  the  genomics  literature. 
But  a  case  can  be  made  that  functional  genomics  is  today  hindered  by  the  paradoxes  that 
Wittgenstein  identified.  If  this  is  true,  until  these  paradoxes  are  recognized  and  addressed, 
functional  genomics  will  continue  to  be  limited  in  its  ability  to  extrapolate  information 
from  genomics  sequences. 

Those  who  ask  "What  is  the  function  of  my  protein?"  expect  a  linguistic  answer 
(Wittgenstein,  1993),  a  sentence  or  two  written  in  the  language  of  the  biologist.  The 
answer  might  take,  as  an  example,  the  form:  "Your  protein  is  a  leptin,  which  regulates  the 
feeding  behavior  of  mice.  When  the  gene  is  mutated  or  deleted,  the  mouse  becomes 
obese"  (Zhang  et  al,  1994). 

How  does  one  get  such  a  linguistic  construct  from  a  genomic  sequence,  which  is  no 
more  (and  no  less)  than  a  chemical  formula  for  an  organic  molecule?  This  question, 
central  to  contemporary  functional  genomics,  is  not  easy  to  answer.  The  simpler  task  of 
predicting  how  the  behavior  (not  function)  of  an  organic  molecule  is  determined  by  its 
structure  remains  one  of  the  great  unsolved  problems  in  chemistry.  In  principle,  we  should 
be  able  to  solve  this  problem.  The  "First  Law  of  Chemistry"  states  that  the  behavior  of  all 
matter  is  determined  by  the  behavior  of  its  constituent  molecules,  even  behavior  that  a 
biologist  might  observe  and  call  a  phenotype.  However,  this  has  not  been  done 
convincingly  for  any  but  the  simplest  of  molecules,  and  we  are  far  from  doing  it  for  the 
general  molecule,  let  alone  a  protein. 
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And  even  if  we  could  do  so,  behavior  would  not  necessarily  lead  to  a  statement  about 
function.  For  example,  it  might  become  predictable  that  the  benzodiazapene  receptor  binds 
tightly  to  valium.  But  the  implied  statement,  "the  purpose  of  this  receptor  is  to  bind  to 
valium",  is  transparently  mis-derived  because  valium  is  synthetic.  To  go  from  molecular 
behavior  to  organismic  fitness,  which  is  the  Darwinian  definition  of  function,  information 
is  required  about  the  entire  organism  and  the  entire  ecosystem. 

"Functional  Equivalency" 

To  obtain  functional  annotation,  contemporary  bioinformatics  generally  attempts  to 
bridge  chemical  sequence  to  biological  fitness  using  a  doctrine  of  "functional 
equivalency"  (for  example,  see  Eisenberg  et  ai,  2000).  This  doctrine  seeks  to  write  a 
linguistic  construct  for  a  new  protein  sequence  by  expropriating  the  linguistic  construct 
from  another  sequence  having  a  similar  chemical  structure,  under  the  assumption  that  the 
two  proteins  with  similar  chemical  structures  have  equivalent  functions.  A  protein  with 
unknown  function  is  found  in  one  genome.  It  is  inferred,  from  its  sequence  similarity,  to 
be  homologous  to  a  different  protein  found  in  a  different  organism.  Homologous  proteins 
are  then  assumed  to  have  equivalent  functions.  The  functional  language  assigned  to  the 
protein  with  the  known  function  is  then  transferred  to  the  new  protein. 

Long  before  the  genomics  revolution  began,  many  cases  were  known  where  this 
doctrine  failed  (Benner  and  Ellington,  1988).  Figure  5-1  illustrates  just  one  example.  Here, 
four  proteins  from  microbial  metabolism,  adenylosuccinate  lyase,  argininosuccinate  lyase, 
aspartase,  and  fumarase  clearly  group  into  homologous  pairs  based  on  sequence  similarity, 
and  are  part  of  an  evolutionary  superfamily  that  includes  all  four  proteins  (Aimi  et  al, 
1990).  One  protein  is  involved  in  nucleic  acid  biosynthesis,  another  is  involved  in  amino 
acid  biosynthesis,  another  is  involved  in  amino  acid  degradation,  and  the  last  is  involved 
in  central  metabolism,  however.  The  biologist  certainly  does  not  regard  the  function  of 
these  proteins  as  equivalent. 
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Figure  5-1 .  Using  analogy  to  determine  function.  Homologous  enzymes  catalyze  four 
reactions:  (a)  in  central  metabolism  (the  citric  acid  cycle)  (b)  in  amino  acid  degradation, 
(c)  in  nucleic  acid  biosynthesis,  and  (d)  in  amino  acid  biosynthesis.  The  enzymes  are 
indisputably  homologous;  even  a  simple  sequence  search  identifies  significant  similarities. 
The  colors  show  the  analogy  between  the  three  catalyzed  reactions  from  the  perspective  of 
organic  chemistry.  The  functions  of  the  proteins,  from  their  roles  in  pathways,  are  quite 
different.  An  annotation  strategy  that  assumes  homologous  proteins  confer  fitness  in  their 
host  organisms  in  an  analogous  way  would  be  misleading  by  this  example. 
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But  should  they?  All  of  these  proteins  use  fumarate  as  a  substrate.  They  all,  in  the 
language  of  the  chemist,  add  the  elements  of  H-X  to  fumarate  using  a  Michael  reaction, 
where  the  carboxylic  acid  functional  group  acts  as  an  electron  sink.  This  type  of  language 
is  very  close  to  that  used  by  the  Enzyme  Commission  when  it  assigns  "EC"  numbers  to 
enzymes.  In  the  language  of  the  chemist,  all  of  these  proteins  have  analogous  function 
because  they  all  catalyze  an  E2  addition  reaction  to  fumarate.  Evolutionary  recruitment  in 
this  family  presumably  occurred  because  of  this  mechanistic  similarity  (Gerlt  and  Babbitt, 
1998). 

The  point  to  be  made  here  is  not  that  one  cannot  infer  function  by  homology  alone. 
Nor  do  we  wish  to  argue  that  the  biologist's  view  of  function  is  right,  while  the  Enzyme 
Commission's  view  is  wrong.  Rather,  the  point  to  be  taken  is  that  the  analysis  of  function 
is  tied  to  the  language  used  to  describe  it.  The  language  used  to  describe  the  systems 
determines  whether  one  sees  "equivalency"  or  "non-equivalency". 

Orthologs  as  Functional  Analogs? 

Some  attempts  to  alleviate  these  problems  are  based  on  the  identification  of 
orthologous  sequences.  Here,  the  homology-implies-equivalency  assumption  is  restricted 
to  a  subset  of  homologs  that  diverged  in  the  most  recent  common  ancestor  of  the  species 
sharing  the  homologs.  This  strategy  is  useful,  of  course.  But  it  is  likely  to  be  far  less 
general  than  is  widely  thought.  Two  species  living  in  the  same  space,  almost  by  axiom, 
cannot  have  identical  strategies  for  survival.  This,  in  turn,  implies  that  two  orthologous 
proteins  may  not  contribute  to  fitness  in  exactly  the  same  way  in  two  species. 

Some  examples  are  useful.  Leptin,  for  example,  is  known  from  genetics  to  be  related 
to  the  obesity  phenotype  in  the  mouse.  The  human  homolog,  almost  certainly  the  ortholog, 
is  known,  and  is  a  target  for  drug  development  as  an  obesity  gene  in  humans.  Some  details 
of  the  molecular  history,  however,  suggested  that  it  might  not  be.  A  reconstruction  of  the 
evolutionary  history  of  the  leptin  family  (Figure  5-2)  shows  that  as  primates  emerged  from 
the  cenancestor  of  mouse  and  human,  the  leptin  gene  underwent  an  episode  of  rapid 
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sequence  evolution  involving  many  non-synonymous  substitutions  in  the  leptin  gene 
(Benner  et  al,  1998).  Indeed,  the  reconstructed  evolutionary  history  (Messier  and  Stewart, 
1997)  of  the  gene  family  shows  that  the  number  of  nonsynonymous  changes  that 
accumulated  in  the  gene  during  this  episode,  divided  by  the  number  of  synonymous 
changes,  normalized  for  the  number  of  nonsynonymous  and  synonymous  sites  (the  Ka/Ks 
ratio,  sometimes  referred  to  as  to,  or  dN/dS)  is  remarkably  high.  In  fact,  the  Ka/Ks  ratio  in 

this  episode  is  higher  than  that  displayed  by  a  pseudogene. 

The  only  explanation  consistent  with  Darwinian  theory  for  this  episode  is  that  leptin 
was  under  "positive  selection  pressure"  (Yang  and  Bielawski,  2000)  as  it  entered  the 
primate  lineage  100  million  years  ago.  Mutant  forms  of  the  primitive  primate  leptin 
evidently  contributed  more  to  the  fitness  of  the  primate  descendents  than  non-mutant 
forms  of  the  protein.  This  suggested,  four  years  ago,  that  human  "leptin"  might  not  play  a 
role  in  humans  analogous  to  the  role  it  plays  in  mice.  At  the  very  least,  a  primate  model  is 
recommended  for  pharmacological  analysis  of  compounds  targeted  towards  this  system. 
And  now,  articles  are  appearing  with  titles  such  as  "Whatever  happened  to  leptin?" 
(Chircurel,  2000),  noting  that  "the  hormone's  precise  physical  role  seems  to  vary  from 
species  to  species." 

Analogous  statements  can  be  made  about  other  pairs  of  orthologs  from  mammalian 
species  (Chandrasekharan  et  al,  1996;  Liberies,  et  al,  2001).  Just  as  we  cannot 
confidently  accept  annotation  made  by  homology,  we  cannot  be  confident  that  annotations 
based  on  orthology  are  correct  either. 

In  fact,  "analogy",  not  "equivalency"  nor  "non-equivalency",  is  the  topic  in  these 
examples  of  annotation.  Analogy  involves  selection  of  some  features  of  a  system  as  being 
more  important  than  others,  and  using  these  features  to  make  a  comparison.  The  Enzyme 
Commission  views  the  structure  of  the  substrate  (fumarate)  and  the  nature  of  the  reaction 
being  catalyzed  (E2  addition,  for  example)  as  the  features  worth  noting.  The  biologist  (at 
least  as  represented  above)  considers  the  pathway  as  the  noteworthy  feature.  The  former  is 
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Figure  5-2.  Evolutionary  tree  for  leptins  extracted  from  the  Master  Catalog.  Numbers  on 
the  branches  are  Ka/Ks  ratios,  the  ratio  of  nonsynonymous  to  synonymous  changes, 
normalized  for  the  number  of  nonsynonymous  and  synonymous  sites  in  the  gene. 
Undefined  (oo)  means  that  no  silent  substitutions  occurred  on  the  branch;  calculating  a 
Ka/Ks  ratio  would  require  division  by  zero.  Reconstructed  evolutionary  sequences  show 
rapid  evolution  of  the  leptin  gene  in  primitive  primates,  consistent  only  with  "positive 
selection"  (where  the  lines  are  red),  and  implying  a  different  "function"  in  primate  leptins 
than  in  the  cenancestral  leptins  (and  rodent  leptins).  The  branch  with  the  Ka/Ks  ratio  of 
1.47  (leading  to  the  apes)  contains  7.63  and  2.57  nonsynonymous  and  synonymous 
substitutions,  respectively  (before  normalization),  meaning  that  the  high  ratio  is  quite 
significant.  The  branch  with  the  Ka/Ks  ratio  of  1 .24  (leading  to  the  rhesus  monkey) 
contains  7.68  and  3.61  nonsynonymous  and  synonymous  substitutions.  The  branch  with 
the  Ka/Ks  ratio  of  0.21  (leading  to  the  rat/mouse  ancestor)  contains  14.31  and  31.62  non- 
synonymous and  synonymous  substitutions. 
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more  likely  to  be  predictable  from  the  molecule  formula  derived  from  the  genomic 
sequence.  The  latter  is  closer  to  the  Darwinian  concept  of  fitness. 

Again,  neither  view  is  "right".  But  the  Wittgenstinian  view  of  functional  genomics 
requires  that  we  understand  the  process  and  language  of  "analogy",  recognize  that  it  is  not 
the  same  as  "equivalency",  and  appreciate  that  an  analogy  is  frequently  more  informative 
about  the  culture  of  the  individual  drawing  the  analogy  than  it  is  about  the  systems 
between  which  the  analogy  is  being  constructed. 

A  Behavioral/Functional  Continuum 

We  can  expect,  almost  from  first  principles,  that  the  near  continuum  in  molecular 
structure  available  to  protein  sequences  is  associated  with  a  near-continuum  of  molecular 
behavior  (Hey,  1999).  This  in  turn,  should  be  associated  with  a  near  continuum  in  fitness. 
Within  this  continuum,  the  case  can  be  frequently  made  that  the  differences  are  more 
interesting  than  the  similarities,  and  need  to  be  captured  and  understood  to  make  a  useful 
functional  annotation. 

Consider,  for  example,  the  family  of  elongation  factors  (EFs)  represented  by  EF-Tu 
(in  bacteria)  and  EF-la  (or  eEFl  A,  in  eukaryotes).  All  are  annotated  in  the  contemporary 
databases  as  having  "the  same  function".  After  all,  they  all  present  a  charged  aminoacyl- 
tRNA  to  the  ribosome.  Closer  inspection  (Lockhart  et  al,  1998;  Moreira  et  al,  1999; 
Gaucher  et  al,  2001)  shows,  however,  that  the  details  by  which  this  presentation  is  done, 
and  the  behaviors  of  individual  EFs  in  general,  are  different,  in  a  way  that  has  an  impact 
on  any  linguistic  description  of  "function"  (Figure  5-3).  For  example,  EF  may  function  in 
eukaryotes  by  binding  to  uncharged  tRNAs  in  the  nucleus,  being  charged  there,  and  then 
being  transporting  to  the  cytosol  via  binding  to  actin.  Regardless  of  the  ability  of  bacteria 
EFs  to  display  these  behaviors  (this  is  under-examined),  the  function  of  EFs  in  bacteria 
cannot  contain  this  language,  as  bacteria  do  not  have  a  nucleus. 
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Figure  5-3.  Continuum  of  elongation  factor  behavior.  Recent  studies  demonstrate  that  the 
behaviors  of  various  Elongation  Factor  Tu/la  proteins  are  different  in  different  members 
of  the  family,  and  these  behavioral  differences  are  functionally  significant.  Undoubtedly, 
"participation  in  translation"  is  the  language  describing  one  behavior  almost  certainly 
important  to  function  (fitness)  in  all  of  these.  Specific  features  of  the  behavior  have, 
however,  changed  (even  to  the  point  of  being  gained  or  lost  entirely)  in  the  evolutionary 
episodes  separating  Nodes  1-4.  Functional  divergence  in  the  GTP-,  GDP-,  tRNA-,  and 
actin-binding  domains  has  been  demonstrated  for  the  highly  conserved  EF  protein.  Shifts 
in  EF  behavior  are  found  throughout  the  phylogenetic  tree;  between  bacteria,  archaea,  and 
eukaryotes  (Node  1),  between  ciliates  and  other  eukaryotes  (Node  2),  between  plastid  and 
non-plastid  bacteria  (Node  3),  and  between  photosynthetic  and  non-photosynthetic 
bacteria  (Node  4).  At  the  core  of  these  studies  lies  the  notion  that  functional  importance  is 
highly  correlated  with  conserved  evolutionary  patterns.  For  example,  ciliate  EFs  display 
functional  divergence  in  the  domains  proposed  to  interact  with  actin.  This  is  consistent 
with  actin  being  a  quantitatively  minor  protein  and  having  an  accelerated  mutation  rate  in 
ciliates.  A  conventional  homology-based  search  would  have  simply  suggested  ciliate  EFs 
have  "the  same  function"  as  other  eukaryotic  EFs  due  to  their  high  sequence  identity,  and 
a  substantial  level  of  analogy  in  other  functional  behaviors.  But  it  is  also  clear  that  a  full 
understanding  of  a  protein's  function  requires  an  analysis  of  the  differences,  and  is  best 
realized  when  sequences  are  placed  within  an  historical,  evolutionary  comparative 
framework. 
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With  EF's,  the  first  level  of  annotation  will  undoubtedly  reflect  the  analog  in  the 
functions  of  different  proteins  from  different  species.  At  the  next  level,  however,  the 
annotation  must  capture  the  differences.  With  EF's,  the  signature  of  functional  change  can 
also  be  found  in  the  sequences,  when  they  are  viewed  with  a  sufficiently  sophisticated 
evolutionary  model  (Gaucher  et  al,  2001). 

A  Way  Forward 
How  might  evolutionary  analyses  be  used  to  generate  linguistic  statements 
concerning  function  for  genomic  sequences?  The  completeness  of  an  organism's  genomic 
sequence  offers  one  advantage;  it  permits  us  to  say  what  is  not  present.  Further,  we  can 
draw  on  classical  descriptions  of  the  history  of  life  known  from  paleontology  and  geology 
to  contrast  with  the  molecular  histories  of  protein  families  reconstructed  from  genomic 
sequence  databases.  Functional  genomics  must  approach  genomic  sequences  in  a 
particular  way  to  facility  this  process. 

(a)  Complete  evolutionary  models  of  a  protein  family  (Benner  et  al,  2000). 
Reconstructed  sequences  of  ancient  proteins,  intermediates  in  evolutionary  history  of  a 
protein  family,  need  to  be  added  to  evolutionary  models  that  include  a  multiple  sequence 
alignment  and  an  evolutionary  tree.  These  ancestral  sequences  increase  the  scope  of 
functional  inferences  that  can  be  made  from  reconstructed  evolutionary  biology. 

(b)  Higher  order  analyses  of  sequence  evolution.  Today,  the  evolution  of  protein 
sequences  is  modeled  using  simple  stochastic  mathematics  that  treats  proteins  as  if  they 
were  formless,  functionless  strings  of  letters.  These  models  are  poor  approximations  for 
reality.  Their  use  comes  from  their  ability  to  provide  a  "null  hypothesis".  The  differences 
in  how  real  proteins  divergently  evolve  and  how  the  stochastic  models  expect  them  to 


Ill 

evolve  produces  a  signal,  informative  about  form  and  function.  Higher  order  analyses 
(Thorne  et  al,  1992)  of  sequence  divergence  capture  this  signal.  These  incorporate 
substitution  rates  that  depend  on  the  site  (gamma  distribution)  (Yang,  1996),  non- 
independence  of  substitutions  at  different  sites  (Olmea  et  al,  1999),  and  higher  order  gap 
penalties  (Benner  et  al,  1993).  Many  examples  are  now  available  where  these  higher 
order  models  support  higher  levels  of  sequence  interpretation.  Site-specific  mutation  rates 
are  correlated  to  functionally  important  sites  on  a  protein  (Gaucher  et  al,  2001).  Shifts  in 
functional  constraints  are  evident  when  a  specific  site  is  rapidly  evolving  in  one  lineage 
but  slowly  evolving  in  another  (covarion  behavior)  (Miyamoto  and  Fitch,  1995).  Non- 
independence  of  sites  is  used  for  protein  structure  prediction  (Benner  et  al,  1997).  We 
expect  these  types  of  analyses  to  be  done  routinely  in  the  future  (Jermann  et  al,  1997; 
Golding  and  Dean,  1998;  Naylor  and  Gerstein,  2000;  Yang  and  Bielawski,  2000). 

(c)  Improve  the  dating  of  events  in  the  reconstructed  molecular  history  of  the  protein 
family.  Genomics  becomes  especially  powerful  when  events  in  the  reconstructed 
molecular  record  are  correlated  with  events  in  the  geological  and  paleontological  records. 
To  make  this  correlation  requires,  however,  a  molecular  clock.  Amino  acid  sequences 
themselves  are  known  to  be  imprecise  molecular  clocks  (Ayala,  1999).  Metrics  that  use 
synonymous  substitutions  are  frequently  used  to  date  molecular  events.  We  expect  to  see 
new  tools  that  reflect  the  complexities  of  the  mutation  process  to  make  dating  more 
reliable,  especially  within  vertebrate  evolution  (Peltier  et  al,  2000).  For  example,  the 
organic  chemistry  and  selective  mechanisms  governing  mutations  within  GC  isochores 
can  lead  to  spurious  estimations  when  performing  phylogenetic  analyses.  Incorporating 
more  complex  evolutionary  models  that  account  for  biases  in  synonymous  substitution 
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rates  greatly  enhance  comparative  analyses  (Moores  and  Holmes,  2000).  This,  in  turn,  will 
open  a  new  avenue  for  extracting  information  about  function  in  an  organismic  and 
ecological  context. 

(d)  Interpret  sequence  evolution  within  the  context  of  three-dimensional  structures. 
The  three  dimensional  structure  of  the  protein  connects  sequence  to  reactivity. 
Permutations  within  primary  sequences  can  be  correlated  to  those  sites  that  are  responsible 
for  protein-ligand  interactions  and  therefore  differences  in  behavior.  A  three  dimensional 
structure  therefore  adds  significantly  to  any  story  in  molecular  evolution,  and  does  so 
especially  when  complex  phenomena  are  being  analyzed  (Miyamoto  and  Fitch,  1995; 
Golding  and  Dean,  1998;  Gaucher  et  al,  2001). 

(e)  Naturally  structured  protein  sequence  databases.  After  all  of  the  genomes  of  all  of 
the  organisms  on  Earth  are  sequenced,  all  of  the  protein  sequences  almost  certainly  will  be 

recognizable  as  being  members  of  one  of  fewer  than  10^  protein  families.  A  naturally 
structured  database  reflects  this  fact,  organizing  sequences  according  to  their  natural 
history.  This  organizational  principle  is  exploited  by  Hovergen  (Duret  et  al,  1994),  COG 
(Tatusov  et  al,  1997),  DOMO  (Gracy  and  Argos,  1998),  Pfam  (Bateman  et  al,  2000),  and 
the  Master  Catalog  (Benner  et  al,  2000). 

Stories  that  combine  part  or  all  of  these  prescriptions  are  now  emerging  for  many 
specific  cases.  These  include:  Myc  and  transferrins  (Gu,  1999),  elongation  factors 
(Lockhart  et  al,  1998;  Moreira  et  al,  1999;  Gaucher  et  al,  2001),  ribonucleases  (Jermann 
et  al,  1995),  opsins  (Chang  and  Donoghue,  2000),  globins  (Naylor  and  Gerstein,  2000), 
and  lysozymes  (Messier  and  Stewart,  1997).  The  ultimate  goal,  however,  will  be  to  join 
these  specific  cases  into  a  unified  model  that  combines  the  molecular  history  of  life  on 
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Earth  with  the  record  from  natural  history  (Benner,  2001).  Such  a  large-scale  analysis  will 
incorporate  dates  in  the  past,  places  on  the  globe,  and  events  in  the  molecular  geological 
and  paleontological  records,  in  a  way  that  connects  genes  and  proteins,  their  host 
organisms,  and  their  ecosystems  set  in  a  planetary  context. 


APPENDIX 
MULTIPLE  SEQUENCE  ALIGNMENT  OF  ELONGATION  FACTORS  ANALYZED 

IN  CHAPTER  1 


AGRTU 

AN  AN  I 

AQUPY 

BACFR 

BACST 

BACSU 

BRELN 

BURCE 

CAMJE 

CORGL 

CYTLY 

DEISP 

FERIS 

FLAFE 

MICLU 

MYCGA 

MYCHO 

MYCLE 

NEIGO 

PLARO 

PLAROB 

RICPR 

SALTY 

SHEPU 

SPIPL 

STIAU 

STRAU 

STRCJ 

STROR 

TAXOC 

THEAQ 

THEMA 

THETH 

THICU 

TREHY 

UREUR 

WOLSU 

AQUAE 1 

AQUAE2 

BBUR 

ECOLI2 

ECOLI1 

HINF 

HPYL 


10         20         30 

I  I  I 

AKSKFERNKPHVNIGTIGHVDHGKTSLTAAITKYF 

ARAKFERTKPHANIGTIGHVDHGKTTLTAAITTVLAKA- 


40  50  60 
I  I  I 

GEFKAYDQIDAAPEEK 

GMAKARAYADI DAAPEEK 
AKEKFERTKEHVNVGTIGHVDHGKSTLTSAITCVLAAGLVEGGKAKCFKYEEIDKAPEEK 

AKEKFERTKPHVNIGTIGHVDHGKTTLTAAITTVLAKK GLSELRSFDSIDNAPEEK 

AKAKFERTKPHVNIGTIGHVDHGKTTLTAAITTVLAKQ GKAEAKAYDQIDAAPEER 

AKEKFDRSKSHANIGTIGHVDHGKTTLTAAITTVLHKK SGKGTAMAYDQIDGAPEER 

AKASFERTKPHVNIGTIGHVDHGKTTLTAAITKVLADQ — YPDLNEARAFDQVDNAPEEK 

AKGKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLTKK FGGEAKAYDQIDAAPEEK 

AKEKFSRNKPHVNIGTIGHVDHGKTTLTAAISAVLSRR GLAELKDYDNIDNAPEEK 

AKAKFERTKPHVNIGTIGHVDHGKTTTTAAITKVLADT — YPELNEAFAFDSIDKAPEEK 

AKETFDRSKPHLNIGTIGHVDHGKTTLTAAITTVLANA GLSELRSFDSIDNAPEEK 

AKGTFERTKPHVNVGTIGHVDHGKTTLTAAITFTAAAS DPTIEKLAYDQIDKAPEEK 

AKVTFVRTKPHMNVGTIGQIDHGKTTLTAAITKYCSFF GWADYTPYEMIDKAPEER 

AKETFKREKPHVNIGTIGHVDHGKTTLTAAITDILSKK GLAQAKKYDEIDGAPEEK 

AKAKFERTKAHVNIGTIGHVDHGKTTLTAAISKVLYDK — YPDLNEARDFATIDSAPEER 

AKERFDRSKPHVNIGTIGHIDHGKTTLTAAICTVLSKA GTSEAKKYDE I DAAPEEK 

AKLDFDRSKPHVNI GTIGHVDHGKTTLTAAI ATVLAKK GLAEARDYAS I DNAPEEK 

AKAKFERTKPHVNIGTIGHVDHGKTTLTAAITKVLHDK--FPNLNESRAFDQIDNAPEER 

AKEKFERSKPHVNVGTIGHVDHGKTTLTAALTTILAKK FGGAAKAYDQI DNAPEEK 

AAKAKLERTKPHMNIGTIGHIDHGKTTLTAAITKVLHDR-YPELNKATPFDKIDKAPEEK 
AKAKFERTKPHMNIGTIGHIDHGKTTLTAAITKVLHDR — YPELNKATPFDKIDKAPEEK 

AKAKFERTKPHVNIGTIGHVDHGKTSLTAAITIILAKT GGAKATAYDQIDAAPEEK 

SKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKT YGGAARAFDQI DNAPEEK 

AKAKFERIKPHVNVGTIGHVDHGKTTLTAAISHVLAKT YGGEAKDFSQIDNAPEER 

ARAKFERNKPHVNIGTIGHVDHGKTTLTAAITMTLAAS GGAKARKYDDIDAAPEEK 

AKEKFERNKPHVNIGTIGHVDHGKTSLTAAITKVLAKT GGATFLAYDQIDKAPEER 

AKAKFERTKPHVNIGTIGHIDHGKTTLTAAITKVLHDK — YPDLNAASAFDQIDKAPEER 
AKAKFERTKPHVNIGTIGHVDHGKTTLTAAITKVLHDA— IPDLNPFTPFDEIDKAPEER 
AKEKYDRSKPHVNIGTIGHVDHGKTTLTAAITTVLARR-LPSAVNQPKDYASIDAAPEER 

AKETFDRSKPHVNIGTIGHVDHGKTTLTAAITTVLANK GLAAKRDFS SI DNAPEEK 

AKGEFIRTKPHVNVGTIGHVDHGKTTLTAALTYVAAAE NPNVEVKDYGDIDKAPEER 

AKEKFVRTKPHVNVGTIGHIDHGKSTLTAAITKYLSLK VLAQYIPYDQIDKAPEEK 

AKGEFVRTKPHVNVGTIGHVDHGKTTLTAALTYVAAAE NPNVEVKDYGDIDKAPEER 

AKSKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLSSK FGGEAKAYDQIDAAPEEK 

AKGTYEGNKTHVNVGTIGHVDHGKTTLTSAITAVSSAM — FPATVQKVAYDSVAKASESQ 

AKAKFERTKPHVNIGTIGHVDHGKTTLTAAISTVLAKK GQAIAQSYADVDKTPEER 

AKKKFVKYKPHVNIGTIGHVDHGKTTLSAAISAVLATK GLCELKDYDAIDNAPEER 

AKEKFERTKEHVNVGTIGHVDHGKSTLTSAITCVLAAGLVEGGKAKCFKYEEIDKAPEEK 
AKEKFERTKEHVNVGTIGHVDHGKSTLTSAITCVLAAGLVEGGKAKCFKYEEIDKAPEEK 

AKEVFQRTKPHMNVGTIGHVDHGKTTLTAAISIYCSKL NKDAKALKYEDI DNAPEEK 

SKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKT YGGAARAFDQI DNAPEEK 

SKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKT YGGAARAFDQI DNAPEEK 

SKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKH YGGAARAFDQIDNAPEEK 

AKEKFNRTKPHVNIGTIGHVDHGKTTLSAAISAVLSLK GLAEMKDYDNIDNAPEEK 
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MGEN  AREKFDRSKPHVNVGTIGHIDHGKTTLTAAICTVLAKE GKSAATRYDEIDKAPEEK 

MPNEU  AREKFDRSKPHVNVGTIGHIDHGKTTLTAAICTVLAKE GKSAATRYDQIDKAPEEK 

MTUB  AKAKFQRTKPHVNIGTIGHVDHGKTTLTAAITKVLHDK — FPDLNETKAFDQIDNAPEER 

SYNP3  ARAKFERTKDHVNIGTIGHVDHGKTTLTAAITMTLAEL GGAKARKYEDIDAAPEEK 

SYNP7  ARAKFERTKPHANIGTIGHVDHGKTTLTAAITTVLAKA GMAKARAYADIDAAPEEK 

T  PAL  AKEKFARTKVHMNVGT I GHVDHGKTTLSAAI TS YCAKK FGDKQLKYDE I DNAPEEK 

GLUPL  GGGGGAEEKPILNVCFIGHVDSGKSTTVGNLAFQLGAIKMDKLKKEAEERGRMDMSAAER 

METVA  GGGGGAKTKPILNVAFIGHVDAGKSTTVGRLLLDGGAILIVRLRKEAEEKGKMDGLKEER 

SULAC  GGGGGGSQKPHLNLIVIGHVDHGKSTLIGRLLMDRGFITVKEAEEAAKKLGKMDRLKEER 


70        80         90        100        110        120 

I  I  I  I  I  I 

AGRTU  ARGITISTAHVEYETPARHYAHVDCPGHADYVKNMITGAAEMDGAILVCSAADGPMPQTR 

ANANI  ARGITINTAHVEYETGNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTR 

AQUPY  ERGITINITHVEYETAKRHYAHVDCPGHADYIKNMITGAAQMDGAILWSAADGPMPQTR 

BACFR  ERGITINTSHVEYETANRHYAHVDCPGHADYVKNMVTGAAQMDGAI IVVAATDGPMPQTR 

BACST  ERGITISTAHVEYETEARHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

BACSU  ERGITISTAHVEYETETRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

BRELN  ERGITINVSHVEYQTEKRHYAHVDAPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

BURCE  ARGITINTAHVEYETANRHYAHVDCPGHADYVKNMITGAAQMDGAILVCSAADGPMPQTR 

CAMJE  ERGITIATSHIEYETDNRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

CORGL  ERGITINISHVEYQTEKRHYAHVDAPGHADYIKNMITGAAQMDGAILWAATDGPMPQTR 

CYTLY  ERGITINTSHVEYSTANRHYAHVDCPGHADYVKNMVTGAAQMDGAILWAATDGPMPQTR 

DEISP  ARGITINTAHVEYNTPTRHYSHVDCPGHADYVKNMITGAAQMDGAILWSSADGPMPQTR 

FERIS  ARGITINITHVEYQTEKRHYAHIDCPGHADYIKNMITGAAQMDGAILVLAATDGPMPQTR 

FLAFE  ERGITINTAHVEYETANRHYAHVDCPGHADYVKNMITGAAQMDGAILWAASDGPMPQTK 

MICLU  QRGITINISHVEYQTEKRHYAHVDAPGHADYIKNMITGAAQMDGAILWAATDGPMAQTR 

MYCGA  ARGITINTAHVEYATQNRHYAHVDCPGHADYVKNMITGAAQMDGGILWSATDGPMPQTR 

MYCHO  ARGITINTSHIEYQTEKRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

MYCLE  QRGITINISHVEYQTEKRHYAHVDAPGHADYIKNMITGAAQMDGAILWAATDGPMPQTR 

NEIGO  ARGITINTSHVEYETETRHYAHVDCPGHADYVKNMITGAAQMDGAILVCSAADGPMPQTR 

PLAROA  ARGITISIAHVEYQTEKRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTK 

PLAROB  ARGITISIAHVEYQTEKRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATAGPMPQTK 

RICPR  ERGITISTAHVEYETQNRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

SALTY  ARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

SHEPU  ERGITINTSHIEYDTPSRHYAHVDCPGHADYVKNMITGAAQMDGAILVVASTDGPMPQTR 

SPIPL  QRGITINTAHVEYETEQRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

STIAU  ERGITISTAHVEYQTKNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTR 

STRAU  QRGITISIAHVEYQTEARHYAHVDCPGHADYIKNMITGAAQMDGAILWAATDGPMPQTK 

STRCJ  QRGITISIAHVEYQTESRHYAHVDCPGHADYIKNMITGAAQMDGAILVVAATDGPMPQTK 

STROR  ERGITINTAHVEYETEKRHYAHIDAPGHADYVKNMITGAAQMDGAILWASTDGPMPQTR 

TAXOC  ERGITINTAHVEYSTANRHYAHVDCPGHADYVKNMVTGAAQMDGAILVVAATDGPMPQTR 

THEAQ  ARGITINTAHVEYETAKRHYSHVDCPGHADYIKNMITGAAQMDGAILWSAADGPMPQTR 

THEMA  ARGITINITHVEYETEKRHYAHIDCPGHADYIKNMITGAAQMDGAILWAATDGPMPQTR 

THETH  ARGITINTAHVEYETAKRHYSHVDCPGHADYIKNMITGAAQMDGAILWSAADGPMPQTR 

THICU  ARGITINTAHVEYETANRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

TREHY  GRRLTIATSHVEYESDNRHYAHVDCPGHADYIKNMITGAAQMDGAILVVSAEDGVMPQTK 

UREUR  ERGITINASHVEYETKTRHYAHVDCPGHADYVKNMITGAAQMDGAILVIAASDGVMAQTK 

WOLSU  ERGITIATSHIEYETENRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

AQUAE1  ERGITINITHVEYETAKRHYAHVDCPGHADYIKNMITGAAQMDGAILWSAADGPMPQTR 

AQUAE2  ERGITINITHVEYETAKRHYAHVDCPGHADYIKNMITGAAQMDGAILWSAADGPMPQTR 

BBUR  ARGITINARHIEYETANRHYAHVDCPGHADYIKNMITGAAQMDAAILLVAADSGAEPQTK 

ECOLI2  ARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

ECOLI1  ARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

HINF  ARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILWAATDGPMPQTR 

HPYL  ERGITIATSHIEYETENRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 
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MGEN  ARGITINSAHVEYSSDKRHYAHVDCPGHADYIKNMITGAAQMDGAILWSATDSVMPQTR 

MPNEU  ARGITINSAHVEYSSDKRHYAHVDCPGHADYIKNMITGAAQMDGAILWSATDSVMPQTR 

MTUB  QRGITINIAHVEYQTDKRHYAHVDAPGHADYIKNMITGAAQMDGAILWAATDGPMPQTR 

SYNP3  ARGITINTAHVEYETDSRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

SYNP7  ARGITINTAHVEYETGNRHYAHVDCPGHADYVKNMITGAAQMDGAILWSAADGPMPQTR 

TPAL  ARGITINTRHLEYQSDRRHYAHIDCPGHADYVKNMITGAAQMDGGILWSAPDGVMPQTK 

GLUPL  ERGITITTSLMKLETSKHMLNVIDCPGHQDFIKNMVTGAAQADVGVVLVPCASCISGTLK 

METVA  ERGVTIDVAHKKFPTAKYEVTIVDCPGHRDFIKNMITGASQADAAVLWNVDSGIQPQTR 

SULAC  ERGVTINLSFMRFETRKYFFTVIDAPGHRDFVKNMITGASQADAAILWSAKAGMSAQTR 


130        140        150        160        170       180 

I  I  I  I  I  I 

AGRTU  EHILLARQVGVPAIVVFLNKVDQVDDAELLELVELEVRELLSSYDFPGDDIPIIKGSALA 

ANANI  EHILLAKQVGVPNIVVFLNKEDMVDDAELLELVELEVRELLSSYDFPGDDIPIVAGSALQ 

AQUPY  EHVLLARQVNVPYIWFMNKCDMVDDEELLELVELEVRELLSKYEYPGDEVPVIRGSALG 

BACFR  EHILLARQVNVPKLWFMNKCDMVEDAEMLELVEMEMRELLSFYDFDGDNTPIIQGSALG 

BACST  EHILLSRQVGVPYIWFLNKCDMVDDEELLELVEMEVRDLLSEYDFPGDEVPVIKGSALK 

BACSU  EHILLSKNVGVPYIVVFLNKCDMVDDEELLELVEMEVRDLLSEYDFPGDDVPWKGSALK 

BRELN  EHVLLARQVGVPYIVVALNKSDMVDDEELLELVEFEVRDLLSSQDFDGDNAPVIPVSALK 

BURCE  EHILLARQVGVPYIIVFLNKCDSVDDAELLELVEMEVRELLSKYDFPGDDTPIVKGSAKL 

CAMJE  EHILLSRQVGVPYIVVFMNKADMVDDAELLELVEMEIRELLSSYDFPGDDTPIISGSALK 

CORGL  EHVLLARQVGVPYILVALNKCDMVEDEEIIELVEMEVRELLAEQDYD-EEAPIVHISALK 

CYTLY  EHILLGRQVGIPRIVVFLNKVDMVDDEELLELVEMEVRELLSFYEYDGDNGPVVSGSALG 

DEISP  EHILLARQVGVPYIWFMNKVDMVDDEELLELVEMEVRELLSKYEFPGDDLPVIKGSALQ 

FERIS  EHVLLARQVNVPAMIVFINKVDMV-DPELVDLVEMEVRDLLSKYEFPGDEVPWRGSALK 

FLAFE  EHILLAAQVGVPKMVVFLNKVDLVDDEELLELVEIEVREELTKRGFDGDNTPIIKGSATG 

MICLU  EHVLLARQVGVPALLVALNKSDMVEDEELLERVEMEVRQLLSSRSFDVDEAPVIRTSALK 

MYCGA  EHILLARQVGVPKMVVFLNKCDVADDPEMQELVEMEVRDLLKSYGFDGDNTPVIRGSALG 

MYCHO  EHILLARQVGVPKIVVFLNKIDMFKDDEMVGLVEMDVRSLLSEYGFDGDNAPIIAGSALK 

MYCLE  EHVLLARQVGVPYILVALNKSDAVDDEELLELVEMEVRELLAAQEFD-EDAPWRVSALK 

NEIGO  EHILLARQVGVPYIIVFMNKCDMVDDAELFQLVEMEIRDLLSSYDFPGDDCPIVQGSALK 

PLAROA  EHVLLARQVGVPYIVVALNKADMVDDEEILELVELEVRELLSAQEFPGDDLPWRVSALK 

PLAROB  EHVLLARQVGVPYIVVALNKADMVDDEEILELVELEVRELLSAQEFPGDDLPWRVSALK 

RICPR  EHILLAKQVGVPAMVVFLNKVDMVDDPDLLELVEMEVRELLSKYGFPGNEIPIIKGSALQ 

SALTY  EHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALK 

SHEPU  EHILLSRQVGVPFIIVFMNKCDMVDDEELLELVEMEVRELLSEYDFPGDDLPVIQGSALK 

SPIPL  EHILLAKQVGVPSIWFLNKADMVDDEELLELVELEVRELLSSYDFPGDDIPIVSGSALK 

STIAU  EHILLARQVGVPYIVVFLNKVDMLDDPELRELVEMEVRDLLKKYEFPGDSIPIIPGSALK 

STRAU  EHVLLARQVGVPYIWALNKADMVDDEEILELVELEVRELLSEYDFPGDDLPWQVSALK 

STRCJ  EHVLLARQSGVPYIWALNKADMVDDEEIMELVELEVRELLSEYEFDGDNCPVVQVSALK 

STROR  EHILLSRQVGVKHLIVFMNKIDLVDDEELLELVEMEIRDLLSEYDFPGDDLPVIQGSALK 

TAXOC  EHILLARQVGVPQLWFMNKVDMVDDPELLELVEMEIRELLSFYDFDGDNIPVVQGSALG 

THEAQ  EHILLARQVGVPYIVVFMNKVDMVDDPELLDLVEMEVRDLLNQYEFPGDEVPVIRGSALL 

THEMA  EHVLLARQVEVPYMIVFINKTDMVDDPELIDLVEMEVRDLLSQYGYPGDEVPVIRGSALK 

THETH  EHILLARQVGVPYIVVFMNKVDMVDDPELLDLVEMEVRDLLNQYEFPGDEVPVIRGSALL 

THICU  EHILLARQVGVPYIIVFLNKCDMVDDAELLELVEMEVRELLSKYDFPGDDTPIIKGSAKL 

TREHY  EHVLLSRQVGVNYIVVFLNKCDKLDDPEMAEIVEAEVIDVLDHYGFDGSKTPIIRGSAIK 

UREUR  EHILLARQVGVPKIVVFLNKCDFMTDPDMQDLVEMEVRELLSKYGFDGDNTPVIRGSGLK 

WOLSU  EHILLSRQVGVPYIVVFLNKEDMVDDAELLELVEMEVRELLSNYDFPGDDTPIVAGSALK 

AQUAE1  EHVLLARQVNVPYIWFMNKCDMVDDEELLELVELEVRELLSKYEYPGDEVPVIRGSALG 

AQUAE2  EHVLLARQVNVPYIWFMSKCDMVDDEELLELVELEVRELLSKYEYPGDEVPVIRGSALG 

BBUR  EHLLLAQRMGIKKIIVFLNKLDLA-DPELVELVEVEVLELVEKYGFS-ADTPIIKGSAFG 

ECOLI2  EHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALK 

ECOLI1  EHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALK 

HINF  EHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALQ 

HPYL  EHILLSRQVGVPHIVVFLNKQDMVDDQELLELVEMEVRELLSAYEFPGDDTPIVAGSALR 
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MGEN  EHILLARQVGVPKMVVFLNKCDIASDEEVQELVAEEVRDLLTSYGFDGKNTPIIYGSALK 

MPNEU  EHILLARQVGVPRMVVFLNKCDIATDEEVQELVAEEVRDLLTSYGFDGKNTPIIYGSALK 

MTUB  EHVLLARQVGVPYILVALNKADAVDDEELLELVEMEVRELLAAQEFD-EDAPVVRVSALK 

SYNP3  EHILLAKQVGVPKLVVFLNKKDMVDDEELLELVELEVRELLSDYDFPGDDIPIVAGSALK 

SYNP7  EHILLAKQVGVPNIVVFLNKEDMVDDAELLELVELEVRELLSSYDFPGDDIPIVAGSALQ 

TPAL  EHLLLARQVGVPSIIVFLNKVDLVDDPELLELVEEEVRDALAGYGFS-RETPIVKGSAFK 

GLUPL  DHIMISGVLGCRKLIVCVNKVDTIDEKNRISRFDEVAKEMKGIIAKSDKDPIIIPISGYL 

METVA  EHVFLIRTLGVRQLAVAVNKMDTFSEADYNELKKMIGDQLLKMIGFNPEQINFVPVASLH 

SULAC  EHIILSKTMGINQVIVAINKMDLYDEKRFKEIVDTIVSKFMKSFGFDMNKVKFVPVVAPD 


190       200       210       220        230       240 

I  I  I  I  I  I 

AGRTU  ALEDSDK KIGEDAIRELMAAVDAYIPTPERPIDQPFLMPIEDVFSISGRGTV 

ANANI  ALEAIQGGASGQKGDNPWVDKILKLMEEVDAYIPTPEREVDRPFLMAVEDVFTITGRGTV 

AQUPY  ALQELEQNSPGK WVGSIKELLNAMDEYIPTPEREVDKPFLMPIEDVFSISGRGTV 

BACFR  ALNGVEK WEDKVMELMEAVDTWIPLPPRDVDKPFLMPVEDVFSITGRGTV 

BACST  ALEGDPK WEEKIIELMNAVDEYIPTPQREVDKPFMMPIEDVFSITGRGTV 

BACSU  ALEGDAE WEAKIFELMDAVDEYIPTPERDTEKPFMMPVEDVFSITGRGTV 

BRELN  ALEGDEK WVKSVQDLMAAVDDNVPEPERDVDKPFLMPVEDVFTITGRGTV 

BURCE  ALEGDTGE LGEVAIMSLADALDTYIPTPERAVDGAFLMPVEDVFSISGRGTV 

CAMJE  ALEEAKAGQDGE WSAKIMDLMAAVDSYIPTPTRDTEKDFLMPIEDVFSISGRGTV 

CORGL  ALEGDEK WGKQILELMQACDDNIPDPVRETDKPFLMPIEDIFTITGRGTV 

CYTLY  ALNGEQK WVDTVMELMEAVDNWIELPKRDVDKDFLMPVEDVFTITGRGTV 

DEISP  ALEALQANPKTARGEDKWVDRIWELLDAVDSYIPTPERATDKTFLMPVEDVFTITGRGTV 

FERIS  AIEAPNDPNDP AYKPIKELLDAMDTYFPDPVREVDKPFLMPIEDVFSITGRGTV 

FLAFE  ALAGEEK WVKEIENLMDAVDSYIPLPPRPVDLPFLMSVEDVFSITGRGTV 

MICLU  ALEGDPQ WVKSVEDLMDAVDEYIPDPVRDKDKPFLMPIEDVFTITGRGTV 

MYCGA  ALNGEPA WEEKIHELMKAVDEYIPTPDREVDKPFLLPIEDTMTITGRGTV 

MYCHO  ALQGDPE YEKGILELMDAVDTYIEEPKRETDKPFLMAVEDVFTITGRGTV 

MYCLE  ALEGDAK WVESVTQLMDAVDESIPAPVRETDKPFLMPVEDVFTITGRGTV 

NEIGO  ALEGDAA YEEKIFELATALDRYIPTPERAVDKPFLLPIEDVFSISGRGTV 

PLAROA  ALEGDEK WADSIIELMNAVDENIPEPPRDTDKPFLMPIEDVFSITGRGTV 

PLAROB  ALEGDEK WADSIIELMNAVDENIPEPPRDTDKPFLMPIEDVFSITGRGTV 

RICPR  ALEGKPE GEKAINELMNAVDTYIPQPIELQDKPFLMPIEDVFSISGRGTV 

SALTY  ALEGDAE WEAKIIELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTV 

SHEPU  ALEGEPE WEAKILELAAALDSYIPEPQRDIDKPFLLPIEDVFSISGRGTV 

SPIPL  ALDFLTENPKTTRGENDWVDKIHALMDEVDAYIPTPERDIDKGLLDGLEDVFSITGRGTV 

STIAU  ALEGDTS DIGEGAILKLMAAVDEYIPTPQRATDKPFLMPVEDVFSIAGRGTV 

STRAU  ALEGDKE WGDKLLGLMDAVDEAIPTPPRDTDKPFLMPVEDVFTITGRGTV 

STRCJ  ALEGDKE WGEKLLGLMKAVDENIPQPERDVDKPFLMPIEDVFTITGRGTV 

STROR  ALEGDSK YEDIIMELMNTVDEYIPEPERDTEKPLLLPVEDVFSITGRGTV 

TAXOC  GLNGDAK WVGTIEQLMDSVDNWIPIPPRLTDQPFLMPVEDVFSITGRGTV 

THEAQ  ALEEMHKNPKTKRGENEWVDKIWELLDAIDEYIPTPVRDVDKPFLMPVEDVFTITGRGTV 

THEMA  AVEAPNDPN HEAYKPIQELLDAMDNYIPDPQRDVDKPFLMPIEDVFSITGRGTV 

THETH  ALEQMHRNPKTRRGENEWVDKIWELLDAIDEYIPTPVRDVDKPFLMPVEDVFTITGRGTV 

THICU  ALEGDKG ELGEGAILKLAEALDTYIPTPERAVDGAFLMPVEDVFSISGRGTV 

TREHY  AIQAIEAGKDPR — TDPDCKCILDLLNALDTYIPDPVREVDKDFLMSIEDVYSIPGRGTV 

UREUR  ALEGDPV WEAKIDELMDAVDSWIPLPERSTDKPFLLAIEDVFTISGRGTV 

WOLSU  ALEEANDQENVG EWGEKVLKLiyiAEVDRYIPTPERDVDKPFLMPVEDVFSIAGRGTV 

AQUAE1  ALQELEQNSP GKWVESIKELLNAMDEYIPTPQREVDKPFLMPIEDVFSISGRGTV 

AQUAE2  ALQELEQNSP GKWVESIKELLNAMDEYIPTPQREVDKPFLMPIEDVFSISGRGTV 

BBUR  AMSNPED PESTKCVKELLESMDNYFDLPERDIDKPFLLAVEDVFSISGRGTV 

ECOLI2  ALEGDAE WEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTV 

ECOLI1  ALEGDAE WEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTV 

HINF  ALNGVAE WEEKILELANHLDTYIPEPERAIDQPFLLPIEDVFSISGRGTV 

HPYL  ALEEAKAGNVG EWGEKVLKLMAEVDAYIPTPERDTEKTFLMPVEDVFSIAGRGTV 
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MGEN       ALEGDPK WEAKIHDLIKAVDEWIPTPTREVDKPFLLAIEDTMTITGRGTV 

MPNEU      ALEGDPK WEAKIHDLMNAVDEWIPTPEREVDKPFLLAIEDTMTITGRGTV 

MTUB       ALEGDAK WVASVEELMNAVDESIPDPVRETDKPFLMPVEDVFTITGRGTV 

SYNP3      AIEGEKE YKDAILELMKAVDDYIDTPEREVDKPFLMAVEDVFSITGRGTV 

SYNP7  ALEAIQGGASGQKGDNPWVDKILKLMEEVDAYIPTPEREVDRPFLMAVEDVFTITGRGTV 

TPAL       ALQDGAS PEDAACIEELLAAMDSYFEDPVRDDARPFLLSIEDVYTISGRGTV 

GLUPL      GINIVEKGDKFE WFKGWKTLEGALNSQIPPPPRPIDKPLRMPIDSIHKIPGIGMV 

METVA      GDNVFKKSERNP WYKGPKTIAEVIDGFQPPPEKPTNLPLRLPIQDVYTITGVGTV 

SULAC      GDNVTHKSTKMP WYNGPKTLEELLDQLEIPPPKPVDKPLRIPIQEVYSISGVGW 

250        260       270       280        290        300 
I  I  I  I  I  I 

AGRTU  VTGRVERGIVKVGEEVEIVGIR-PTSKTTVTGVEMFRKLLDQGQAGDNIGALVRGVTRDG 

ANANI  ATGRIERGSVKVGETIEIVGLR-DTRSTTVTGVEMFQKTLDEGLAGDNVGLLLRGIQKTD 

AQUPY  VTGRVERGVLRPGDEVEIVGLREEPLKTVATSIEMFRKVLDEALPGDNIGVLLRGVGKDD 

BACFR  ATGRIETGVIHVGDEIEILGLG-EDKKSVVTGVEMFRKLLDQGEAGDNVGLLLRGVDKNE 

BACST  ATGRVERGTLKVGDPVEIIGLSDEPKATTVTGVEMFRKLLDQAEAGDNIGALLRGVSRDE 

BACSU  ATGRVERGQVKVGDEVEIIGLQEENKKTTVTGVEMFRKLLDYAEAGDNIGALLRGVSREE 

BRELN  VTGRVERGVLLPNDEIEIVGIKEKSSKTTVTAIEMFRKTLPDARAGENVGLLLRGTKRED 

BURCE  VTGRVERGIVKVGEEIEIVGIK-PTVKTTCTGVEMFRKLLDQGQAGDNVGILLRGTKRED 

CAMJE  VTGRIEKGWKVGDTIEIVGIK-DTQTTTVTGVEMFRKEMDQGEAGDNVGVLLRGTKKEE 

CORGL  VTGRVERGTLNVNDDVDIIGIKEKSTSTTVTGIEMFRKLLDSAEAGDNCGLLLRGIKRED 

CYTLY  ATGRIETGVANTGDAVDIIGMGADKLASTITGVEMFRKILDRGEAGDNVGILLRGIEKSQ 

DEISP  ATGRVERGWKVQDEVEIIGLR-DTKKTTVTGIEMHRKLLDSGMAGDNVGVLLRGVARDD 

FERIS  VTGRIERGVIKPGVEAEIIGMSYEIKKTVITSVEMFRKELDEAIAGDNVGCLLRGSSKDE 

FLAFE  ATGRIERGRIKVGEPVEIVGLQESPLNSTVTGVEMFRKLLDEGEAGDNAGLLLRGVEKTQ 

MICLU  VTGRAERGTLKINSEVEIVGIR-DVQKTTVTGIEMFHKQLDEAWAGENCGLLVRGLKRDD 

MYCGA  VTGRVERGQLKVGEEVEIVGIT-DTRKVWTGIEMFRKELDAAMAGDNAGILLRGVDRKD 

MYCHO  ATGRVERGVLQLNEEVEIVGLK-PTKKTVVTGIEMFRKNLKEAQAGDNAGLLLRGIDRSE 

MYCLE  VTGRVERGWNVNEEVEIVGIRQTTTKTTVTGVEMFRKLLDQGQAGDNVGLLLRGIKRED 

NEIGO  VTGRVERGIIHVGDEIEIVGLK-ETQKTTCTGVEMFRKLLDEGQAGDNVGVLLRGTKRED 

PLAROA  VTGRIERGWKVNEQVDIIGIKSEKTTTTVTSIEMFNKMLDEGHAGDNAALLLRGIKREQ 

PLAROB  VTGRIERGVVKVNEQVDIIGIKSEKTTTTVTSIEMFNKMLDEGHAGDNAALLLRGIKREQ 

RICPR  VTGRVESGIIKVGEEIEIVGLK-NTQKTTCTGVEMFRKLLDEGQSGDNVGILLRGTKREE 

SALTY  VTGRVERGIIKVGEEVEIVGIK-ETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREE 

SHEPU  VTGRVERGIVRVGDEVEIVGVR-ATTKTTCTGVEMFRKLLDEGRAGENCGILLRGTKRDD 

SPIPL  STAGIERGKVKVGDTVELIGIK-DTRTTTVTGAEMFQKTLEEGMAGDNVGLLLRGIQKND 

STIAU  ATGRVERGKIKVGEEVEIVGIR-PTQKTVITGVEMFRKLLDEGMAGDNIGALLRGLKRED 

STRAU  VTGRIERGVLKVNETVDIIGIKTEKTTTTVTGIEMFRKLLDEGQAGENVGLLLRGIKRED 

STRCJ  VTGRIERGVLKVNETVDIIGIKTEKTTTTVTGIEMFRKLLDEGQAGENVGLLLRGIKRED 

STROR  ASGRIDRGTVRVNDEIEIVGIKEETQKAVVTGVEMFRKQLDEGLAGDNVGVLLRGVQRDE 

TAXOC  ATGRIERGVINSGEPVEILGMGAENLKSTVTGVEMFRKILDRGEAGDNVGLLLRGIEKEA 

THEAQ  ATGRIERGKVKVGDEVEIVGLAPETRKTVVTGVEMHRKTLQEGIAGDNVGLLLRGVSREE 

THEMA  VTGRIERGRIRPGDEVEIIGLSYEIKKTVVTSVEMFRKELDEGIAGDNVGCLLRGIDKDE 

THETH  ATGRIERGKVKVGDEVEIVGLAPETRRTVVTGVEMHRKTLQEGIAGDNVGVLLRGVSREE 

THICU  VTGRVERGIIKVGEEIEIVGLK-PTLKTTCTGVEMFRKLLDQGQAGDNVGILLRGTKREE 

TREHY  VTGRIERGKIEKGNEVEIVGIR-PTQKTTCTGVEMFKKEVV-GIAGYNVGCLLRGIERKA 

UREUR  VTGRVERGVLKVNDEVE I VGLK- DTQKT VVTGI EMFRKSLDQAEAGDNAG I LLRGI KKED 

WOLSU  VTGRIERGWKVGDEVEIVGIR-NTQKTTVTGVEMFRKELDKGEAGDNVGVLLRGTKKED 

AQUAE1  VTGRVERGVLRPGDEVEIVGLREEPLKTVATSIEMFRKVLDEALPGDNIGVLLRGVGKDD 

AQUAE2  VTGRVERGVLRPGDEVE I VGLREE  PLKTVATS I EMFRKVLDEALPGDNIGVLLRGVGRDD 

BBUR  ATGRIERGIIKVGQEVEIVGIK-ETRKTTVTGVEMFQKILEQGQAGDNVGLLLRGVDKKD 

ECOLI2  VTGRVERGIIKVGEEVEIVGIK-ETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREE 

ECOLI1  VTGRVERGIIKVGEEVEIVGIK-ETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREE 

HINF  VTGRVERGIIRTGDEVEIVGIK-DTAKTTVTGVEMFRKLLDEGRAGENIGALLRGTKREE 

HPYL  VTGRIERGWKVGDEVEIVGIR-PTQKTTVTGVEMFRKELEKGEAGDNVGVLLRGTKKEE 


119 


MGEN  VTGRVERGELKVGQEVEIVGLK-PIRKAVVTGIEMFKKELDSAMAGDNAGVLLRGVERKE 

MPNEU  VTGRVERGELKVGQE I EI VGLR- PI RKAVVTGI EMFKKELDSAMAGDNAGVLLRGVDRKE 

MTUB  VTGRVERGVINVNEEVEIVGIRPSTTKTTVTGVEMFRKLLDQGQAGDNVGLLLRGVKRED 

SYNP3  ATGRIERGKVKVGEEISIVGIK-DTRKATVTGVEMFQKTLEEGMAGDNVGLLLRGIQKED 

SYNP7  ATGRIERGSVKVGETIEIVGLR-DTRSTTVTGVEMFQKTLDEGLAGDNVGLLLRGIQKTD 

TPAL  VTGRIECGVISLNEEVEIVGIK-PTKKTWTGIEMFNKLLDQGIAGDNVGLLLRGVDKKE 

GLUPL  YTGRVSTGAIKPGMVEIWSSQPTGWAEVKTLEIHKQSRAAWSGENCGVALKAASQGN 

METVA  PVGRVETGIIKPGDVEKWFEPPAGAIGEIKTVEMHHEQLPSAEPGDNIGFNVRGVGKKD 

SULAC  PVGRIESGVLKVGDVEKIVFMPPVGKIGEVRSIETHHTKIDKAEPGDNIGFNVRGVEKKD 


310        320       330        340        350        360 

I  I  I  I  I  I 

AGRTU  VERGQILCKPGSVKPHKKFMAEAYILTKEEGGRHTPFFTNYRPQFYFRTTDVTG-IVSLP 

ANANI  IERGMVLAKPGSITPHTKFESEVYVLKKEEGGRHTPFFPGYRPQFYVRTTDVTGAISDFT 

AQUPY  VERGQVLAQPGSVKAHRKFRAQVYVLSKEEGGRHTPFFVNYRPQFYFRTADVTGTWKLP 

BACFR  IKRGMVLCKPGQIKPHSKFKAEVYILKKEEGGRHTPFHNKYRPQFYLRTMDCTG-EITLP 

BACST  VERGQVLAKPGSITPHTKFKAQVYVLTKEEGGRHTPFFSNYRPQFYFRTTDVTG-IITLP 

BACSU  IQRGQVLAKPGTITPHSKFKAEVYVLSKEEGGRHTPFFSNYRPQFYFRTTDVTG- I IHLP 

BRELN  VERGQVIVKPGSITPHTKFEAQVYILSKDEGGRHNPFYSNYRPQFYFRTTDVTG-VITLP 

BURCE  VERGQVLAKPGSITPHTHFTAEVYVLSKDEGGRHTPFFNNYRPQFYFRTTDVTG-SIELP 

CAMJE  VIRGMVLAKPKSITPHTDFEAEVYILNKDEGGRHTPFFNNYRPQFYVRTTDVTG-SIKLA 

CORGL  VERGQVIVKPGAYTPHTEFEGSVYVLSKDEGGRHTPFFDNYRPQFYFRTTDVTG-VVKLP 

CYTLY  ISRGMVICKPGSVKPHSKFEAEVYILKKEEGGRHTPFHNNYRPQFYVRTTDVTG-TISLP 

DEISP  VERGQVLAKPGSIKPHTKFEASVYVLSKDEGGRHSAFFGGYRPQFYFRTTDVTG-WELP 

FERIS  VERGQVLAKPGSITPLKKFKANIYVLKKEEGGRHTPFTKGYKPQFYIRTADVTGEIVDLP 

FLAFE  IRRGMVIVKPGSITPHTDFKGEVYVLSKDEGGRHTPFFNKYRPQFYFRTTDVTG-EVELN 

MICLU  VERGQVLVEPGSITPHTNFEANVYILSKDEGGRHTPFYSNYRAQFYFRTTDVTG-VITLP 

MYCGA  VQRGQVLAKPGSITPHKKFRAEIYALKKDEGGRHTAFLNGYRPQFYFRTTDVTG-SIQLK 

MYCHO  VERGQVLAKPKTIVPHTQFEATVYVLKKEEGGRHTPFFHNYKPQFYFRTTDVTG-GIEFK 

MYCLE  VERGQWIKPGTTTPHTEFEGQVYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-VVTLP 

NEIGO  VERGQVLAKRGTITPHTKFKAEVYVLSKEEGGPHTPFFANYRPQFYFRTTDVTG-TITLE 

PLAROA  VERGQCIIKPGTTTPHTEFEAQVYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-VVNLP 

PLAROB  VERGQCIIKPGTTTPHTEFEAQVYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-VVNLP 

RICPR  VERGQVLAKPGSIKPHDKFEAEVYVLSKEEGGRHTPFTNDYRPQFYFRTTDVTG-TIKLP 

SALTY  IERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTG-TIELP 

SHEPU  VERGQVLAKPGSINPHTTFESEVYVLSKEEGGRHTPFFKGYRPQFYFRTTDVTG-TIELP 

SPIPL  VQRGMVIAKPKSITPHTKFEAEVYILKKEEGGRHTPFFKGYRPQFYVRTTDVTGTIDEFT 

STIAU  LERGQVLANWGSINPHTKFKAQVYVLSKEEGGRHTPFFKGYRPQFYFRTTDVTG-TVKLP 

STRAU  VERGQVIIKPGSVTPHTEFEAAAYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-WTLP 

STRCJ  VERGQCIIKPGTVTPHTEFEATAYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-WTLK 

STROR  IERGQVIAKPGSINPHTKFKGEVYILTKEEGGRHTPFFNNYRPQFYFRTTDVTG-SIELP 

TAXOC  IRRGMVICKPGSVTPHKKFKAEVYVLSKEEGGRHTPFFNNYRPQFYFRTTDVTG-IISLA 

THEAQ  VERGQVLAKPGSITPHTKFEASVYILKKEEGGRHTGFFTGYRPQFYFRTTDVTG-VVRLP 

THEMA  VERGQVLAAPGSIKPHKRFKAQIYVLKKEEGGRHTPFTKGYKPQFYIRTADVTGEIVGLP 

THETH  VERGQVLAKPGSITPHTKFEASVYVLKKEEGGRHTGFFSGYRPQFYFRTTDVTG-VVQLP 

THICU  VERGQVLCKPGSIKPHTHFTAEVYVLSKDEGGRHTPFFNNYRPQFYFRTTDVTG-AIELP 

TREHY  VERGQVLAKPGTITPHKKFEAEVYILKKEEGGRHSGFVSGYRPQMYFRTTDVTG-VINLQ 

UREUR  VERGQVLVKPGSIKPHRTFTAKVYILKKEEGGRHTPIVSGYRPQFYFRTTDVTG-AISLP 

WOLSU  VERGMVLCKIGSITPHTNFEGEVYVLSKEEGGRHTPFFNGYRPQFYVRTTDVTG-SISLP 

AQUAE1  VERGQVLAQPGSVKAHKRFRAQVYVLSKEEGGRHTPFFVNYRPQFYFRTADVTGTWKLP 

AQUAE2  VERGQVLAQPGSVKAHKRFRAQVYVLSKEEGGRHTPFFVNYRPQFYFRTADVTGTWKLP 

BBUR  IERGQVLSAPGTITPHKKFKASIYCLTKEEGGRHKPFFPGYRPQFFFRTTDVTG-WAL- 

ECOLI2  IERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTG-TIELP 

ECOLI1  IERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTG-TIELP 

HINF  IERGQVLAKPGSITPHTDFESEVYVLSKDEGGRHTPFFKGYRPQFYFRTTDVTG-TIELP 

HPYL  VERGMVLCKPGSITPHKKFEGEIYVLSKEEGGRHTPFFTNYRPQFYVRTTDVTG-SITLP 
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MGEN  VERGQVLAKPGSIKPHKKFKAEIYALKKEEGGRHTGFLNGYRPQFYFRTTDVTG-SIALA 

MPNEU  VERGQVLAKPGSIKPHKKFKAEIYALKKEEGGRHTGFLNGYRPQFYFRTTDVTG-SISLP 

MTUB  VERGQVVTKPGTTTPHTEFEGQVYILSKDEGGRHTPFFNNYRPQFYFRTTDVTG-VVTLP 

SYNP3  IERGMVLAKPGSITPHTEFEGEVYVLKKEEGGRHTPFFANYRPQFYVRTTDVTGTIKSYT 

SYNP7  IERGMVLAKPGSITPHTKFESEVYVLKKDEGGRHTPFFPGYRPQFYVRTTDVTGAISDFT 

TPAL  VERGQVLSKPGSIKPHTKFEAQIYVLSKEEGGRHSPFFQGYRPQFYFRTTDITG-TISLP 

GLUPL  IKPGHVFSNTKDVEIFEAARAKIVVVAHPKKPGYCPTMDLGTHHVPCQITKFIS-KRMPG 

METVA  IKRGDVLGHTTNPTVATDFTAQIVVLQHPSTDGYTPVFHTHTAQIACTFAEIQK-LNPAT 

SULAC  VKRGDVAGSVQNPTVADEFTAQVIVIWHPTGVGYTPVLHVHTASIACRVSEITS-IDPKT 


370        380       390        400       409 
I  I  I  I  I 

AGRTU  EGTEMVMPGDNVTVEVELIVPIAMEEKLRFAIREGGRTVGAGIVASIVE 

ANANI  ADDGMVIPGDRIKMTVELINPIAIEQGMRFAIREGGRTIGAGWSKILQ 

AQUPY  EGVEMVMPGDNVELEVELIAPVALEEGLRFAIREGGRTVGAGWTKILD 

BACFR  EGTEMVMPGDNVTITVELIYPVALNIGLRFAIREGGRTVGAGQITEIID 

BACST  EGVEMVMPGDNVEMTVELIAPIAIEEGTKFSIREGGRTVGAGSVSEIIE 

BACSU  EGVEMVMPGDNTEMNVELISTIAIEEGTRFSIREGGRTVGSGWSTITE 

BRELN  EGTEMVMPGDNTDMSVELIQPIAMEDRLRFAIREGGRTVGAGRVTKITA 

BURCE  KDKEMVMPGDNVSITVKLIAPIAMEEGLRFAIREGGRTVGAGWAKILD 

CAMJE  DGVEMVMPGENVRITVSLIAPVALEEGTRFAIREGGKTVGSGVVSKI IK 

CORGL  EGTEMVMPGDNVDMSVTLIQPVAMDEGLRFAIREGSRTVGAGRVTKIIK 

CYTLY  SGVEMVMPGDNLTITVELLSPIALSEGLRFAIREGGRTVGAGQVTKIIE 

DEISP  EGVEMVMPGDNITFVVELIKPIAMEEGLRFAIREGGRTVGAGWAKVLE 

FERIS  AGVEMVMPGDNVEMTIELIYPVAIEKGMRFAVREGGRTVGAGVVSEIIE 

FLAFE  AGTEMVMPGDNTNLTVKLIQPIAMEKGLKFAIREGGRTVGAGQVTEILK 

MICLU  EGTEMVMPGDTTEMSVELIQPIAMEEGLGFAIREGGRTVGSGRVTKITK 

MYCGA  EGTEMVMPGDNTEIIVELISSIACEKGSKFSIREGGRTVGAGTWEVLE 

MYCHO  PGREMVVPGDNVELTVTLIAPIAIEEGTKFSIREGGRTVGAGSVTKILK 

MYCLE  EGTEMVMPGDNTNISVTLIQPVAMDEGLRFAIREGGRTVGAGRWKI  IK 

NEIGO  KGVEMVMPGENVTITVELIAPIAMEEGLRFAIREGGRTVGAGWSSVIA 

PLAROA  EGTEMVMPGDNTEMTVQLIQPIAMEEGLKFAIREGGRTVGAGRVTKILK 

PLAROB  EGTEMVMPGDNTEMTVQLIQPIAMEEGLKFAIREGGRTVGAGRVTKILK 

RICPR  SDKQMVMPGDNATFSVELIKPIAMQEGLKFSIREGGRTVGAGIVTKINN 

SALTY  EGVEMVMPGDNIKMWTLIHPIAMDDGLRFAIREGGRTVGAGWAKVLG 

SHEPU  EGVEMVMPGDNIKMVVTLICPIAMDEGLRFAIREGGRTVGAGVVAKIIA 

SPIPL  ADDGMVIPGDRINMTVQLICPIAIEQGMRFAIREGGRTVGAGWAKILA 

STIAU  DNVEMVMPGDNIAIEVELITPVAMEKELPFAIREGGRTVGAGWADIIA 

STRAU  EGTEMVMPGDNTDMTVALIQPVAMEEGLKFAIREGGRTVGAGQVTKITK 

STRCJ  EGTEMVMPGDNAEMTVNLIQPVAMEEGLRFTIREGGRTVGAGQWKINK 

STROR  AGTEMVMPGDNVTIDVELIHPIAVEQGTTFSIREGGRTVGSGMVTEIEA 

TAXOC  EGVEMVMPGDNVTISVELINAVAMEKGLRFAIREGGRTVGAGQVTEILD 

THEAQ  QGVEMVMPGDNVTFTVELIKPVALEEGLRFAIREGGRTVGAGWTKILE 

THEMA  EGVEMVMPGDHVEMEIELIYPVAIEKGQRFAVREGGRTVGAGWTEVIE 

THETH  PGVEMVMPGDNVTFTVELIKPVALEEGLRFAIREGGRTVGAGWTKILE 

THICU  KDKEMVMPGDNVSITVKLIAPIAMEEGLRFAIREGGRTVGAGWAKIIE 

TREHY  GDAQMIMPGDNANLTIELITPIAMEEKQRFAIREGGKTVGNGWTKNIR 

UREUR  AGVDLVMPGDDVEMTVELIAPVAIEDGSKFSIREGGKTVGHGSVIKTSN 

WOLSU  EGVEMVMPGDNVKINVELIAPVALEEGTRFAIREGGRTVGAGWTKITK 

AQUAE1  EGVEMVMPGDNVELEVELIAPVALEEGLRFAIREGGRTVGAGWTKILD 

AQUAE2  EGVEMVMPGDNVELEVELIAPVALEEGLRFAIREGGRTVGAGVVTKILD 

BBUR  EGKEMVMPGDNVDIIVELISSIAMDKNVEFAVREGGRTVASGRILEILE 

ECOLI2  EGVEMVMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGWAKVLG 

ECOLI1  EGVEMVMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLS 

HINF  EGVEMVMPGDNIKMTVSLIHPIAMDQGLRFAIREGGRTVGAGWAKIIK 

HPYL  EGVEMVMPGDNVKITVELISPVALELGTKFAIREGGRTVGAGVVSNIIE 
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MGEN  ENTEMVLPGDNASI TVELIAPIACEKGSKFS I REGGRTVGAGTVTEVLE 

MPNEU  ENTEMVLPGDNTSITVELIAPIACEKGSKFSIREGGRTVGAGSVTKCLN 

MTUB  EGTEMVMPGDNTNISVKLIQPVAMDEGLRFAIREGGRTVGAGRVTKIIK 

SYNP3  ADDGMVMPGDRIKMTVELINPIAIEQGMRFAIREGGRTIGAGVVSKILK 

SYNP7  ADDGMVIPGDRIKMTVELINPIAIEQGMRFAIREGGRTIGAGWSKILQ 

TPAL  EGVDMVKPGDNTKIIGELIHPIAMDKGLKLAIREGGRTIASGQVTEILL 

GLUPL  IKEEIPSPGENVTCIIHPQKQVVMETLLRFALRDAGRIVGIGAIEARYT 

METVA  GEVLEENPGDAAIVKLIPTKPMVIESVLRFAIRDMGMTVAAGMAIQVTA 

SULAC  GKEAEKNPGDSAIVKFKPIKELVAEKFLRFAMRDMGKTVGVGVIIDVKP 
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